bcgsc / NanoSim

Nanopore sequence read simulator
Other
246 stars 57 forks source link

How to simulate amplicon-seq data? #221

Open capoony opened 2 months ago

capoony commented 2 months ago

Hi all,

apologies for yet another request! Specifically, I want to simulate amplicon-seq reads of ONT data using NanoSim but fail at the simulation step which does not finish (at least within hours).

I have a reference sequence based on Sanger sequencing of the amplicon (Stor1_cox1.fa). In addition, I have ONT data of the same amplicon (COX1.fastq), which I could use for model training.

Following your suggestion in issue 112, I am using the "transcriptome" method.

conda activate nanosim

read_analysis.py transcriptome \
    -i ${wd}Syrphid/results/demo_ext/data/demultiplexed/Stor-1/COX1.fastq \
    -rg ${wd}simulations/data/Stor1_cox1.fa \
    -rt ${wd}simulations/data/Stor1_cox1.fa \
    -o ${wd}simulations/data/COX1_training \
    --no_intron_retention \
    -t 100

This finisihes without error. However, when I want to use the model for simulations, the script gets stuck even when simulating only 100 reads.

printf  """target_id\test_counts\tpm\nENSStor-1\t1000\t1000\n""" > ${wd}simulations/data/Stor1_cox1.exp

simulator.py transcriptome \
    -rt ${wd}simulations/data/Stor1_cox1.fa \
    -c ${wd}simulations/data/COX1_training \
    -o ${wd}simulations/data/Stor1_cox1_sim \
    -e ${wd}simulations/data/Stor1_cox1.exp \
    -n 100 \
    --no_model_ir \
    -t 4

Can you help me with this?

Moreover, I am wondering if this model can also be used for other amplicons with longer read lengths? I fear not if I understand the logic correctly. What to do in this case (when there is no amplicon-specific Training data available)?

Thanks a lot,

Testdata.zip

Martin