bcgsc / NanoSim

Nanopore sequence read simulator
Other
217 stars 51 forks source link

Simulation stops and hangs #112

Open molleraj opened 3 years ago

molleraj commented 3 years ago

Hi there,

I am trying to simulate nanopore amplicon reads (~6.5 kbp longest amplicon, ~37 kb total sequence) but the simulation part seems to hang and never proceed. I was able to train an error profile:

~/NanoSim/src/read_analysis.py genome -i ~/VISA_amplicon_bonito/DNA_mix/N384_parent_amplicon.fasta -rg WT_multiplex.fa -o parent_read_training -t 16 &

But when I run the simulation, I get preliminary files made and then the simulation just hangs for hours. How long should I expect a simulation of ~300,000 reads to take?

This is the command I am using for simulation. I get no errors when running, it just stops doing anything after the log given.

(base) agmolle@emergent:~/nanopore_sim3$ ~/NanoSim/src/simulator.py genome -rg WT_multiplex.fa -c parent_read_training -o N384_parent2 -n 300000 -t 16 -dna_type linear &
[1] 25355
(base) agmolle@emergent:~/nanopore_sim3$
running the code with following parameters:

ref_g WT_multiplex.fa
model_prefix parent_read_training
out N384_parent2
number [300000]
perfect False
kmer_bias None
basecaller None
dna_type linear
strandness None
sd_len None
median_len None
max_len inf
min_len 50
fastq False
chimeric False
num_threads 16
2021-03-23 11:41:23: /home/agmolle/NanoSim/src/simulator.py genome -rg WT_multiplex.fa -c parent_read_training -o N384_parent2 -n 300000 -t 16 -dna_type linear
2021-03-23 11:41:23: Read in reference
2021-03-23 11:41:23: Read error profile
2021-03-23 11:41:23: Read KDF of unaligned reads
2021-03-23 11:41:23: Read KDF of aligned reads
2021-03-23 11:41:23: Read chimeric simulation information
2021-03-23 11:41:23: Start simulation of aligned reads
2021-03-23 11:41:23: Number of reads simulated >> 200
(base) agmolle@emergent:~/nanopore_sim3$

Any help would be greatly appreciated.

Jon

molleraj commented 3 years ago

I just tried the same with a full genome (2.8 Mb) as a template and it works great. That genome is also normalized. I wonder if I need to adjust the maximum read length to less than the smallest contig.

SaberHQ commented 3 years ago

Hi @molleraj Thanks for using NanoSim.

So when you say you used the full genome as a template, do you mean you used it for training? Basically, how NanoSim works is that it learns the read length distribution in the training phase and uses Kernel Density Distribution (KDE) to sample from that distribution in the simulation phase. Can you elaborate a bit more on what you changed in your second try that made it work.

I am also pinging @cheny19 to take a look at this.

agshumate commented 3 years ago

I also experience the issue of simulatory.py hanging and not actually simulating any reads. Here is my command and parameters

ref_g GRCh38.fa ref_t hg38c_protein_and_lncRNA_sorted.fa exp hg38_stranded_spacer.tpm model_prefix training out simulated_dRNA number 100000 perfect False kmer_bias None basecaller guppy read_type dRNA model_ir True dna_type transcriptome strandness None max_len inf min_len 50 uracil False polya None fastq True num_threads 1 2021-03-23 14:57:33: /home/ashumate/miniconda/envs/myenv/bin/simulator.py transcriptome -rt hg38c_protein_and_lncRNA_sorted.fa -rg GRCh38.fa -e hg38_stranded_spacer.tpm -r dRNA -n 100000 --fastq -o simulated_dRNA -b guppy -c training 2021-03-23 14:57:33: Read in reference 2021-03-23 14:57:41: Read in expression profile 2021-03-23 14:57:41: Read in reference genome and create .fai index file 2021-03-23 14:57:41: Read in IR markov model 2021-03-23 14:57:41: Read in GFF3 annotation file 2021-03-23 14:57:41: Read error profile 2021-03-23 14:57:41: Read KDF of unaligned reads 2021-03-23 14:57:42: Read KDF of aligned reads 2021-03-23 14:57:43: Start simulation of aligned reads

molleraj commented 3 years ago

Hi @molleraj Thanks for using NanoSim.

So when you say you used the full genome as a template, do you mean you used it for training? Basically, how NanoSim works is that it learns the read length distribution in the training phase and uses Kernel Density Distribution (KDE) to sample from that distribution in the simulation phase. Can you elaborate a bit more on what you changed in your second try that made it work.

I am also pinging @cheny19 to take a look at this.

Hey @SaberHQ, I then concatenated all amplicons into one sequence and then the simulation was successful. It seems the simulator can't handle FASTA files with multiple sequences.

agshumate commented 3 years ago

my issue was this line if transcript_id.startswith("ENS") and tpm > 0:

my transcripts in my expression file do not start with ENS so the dictionary of expression values was empty causing the program to get stuck in an infinite loop. is there a reason why tran ids must start with ENS?

SaberHQ commented 3 years ago

@agshumate Good catch. The reason I implemented it that way is to follow the official format of annotation files. The reference transcriptomes and their annotation files contains trx ids starting with ENS. Would you mind to tell me how did you calculated your expression values? I assume you have to align bunch of reads against your reference transcriptome to quantify their expression. At that step, transcripts in your reference should usually contain ENS if I am not wrong. I guess you didn't use a reference transcriptome in your case. I can totally understand that and it is possible when someone works with draft transcriptomes.

Now that you brought this issue up, I will take a look at it and see if we can get rid of that "if" statement. I may also add more information to the README file to explain expected expression file format in details. Thanks a lot.

SaberHQ commented 3 years ago

@molleraj I should confirm that NanoSim CAN handle FASTA files with multiple sequences. I don't think that was the reason in your case. Is it possible that it is related to the issue @agshumate mentioned here? Please double check your expression file format and make sure everything looks fine.

agshumate commented 3 years ago

@SaberHQ we're using the RefSeq annotation which does not use ENS as prefixes to transcript ids. As far as i know that is specific to ensembl annotations. i deleted that if statement and it seems to be working great for me now. Not sure if it will cause any downstream problems but so far so good.

SaberHQ commented 3 years ago

@agshumate I believe there will be no downstream problems. Removing it would be also beneficial for cases in which people use draft assembly transcriptomes of their own. I will leave it to @cheny19 to take a final look and make sure removing that "if" statement is fine.

Thanks again for reporting this.

molleraj commented 3 years ago

@molleraj I should confirm that NanoSim CAN handle FASTA files with multiple sequences. I don't think that was the reason in your case. Is it possible that it is related to the issue @agshumate mentioned here? Please double check your expression file format and make sure everything looks fine.

Hey @SaberHQ, I am using NanoSim in genome rather than transcriptome mode with the reference genome having one sequence rather than multiple. I think it's specifically that NanoSim doesn't like a reference genome in multiple pieces, but I would be more than happy to be wrong.

SaberHQ commented 3 years ago

@molleraj I should confirm that NanoSim CAN handle FASTA files with multiple sequences. I don't think that was the reason in your case. Is it possible that it is related to the issue @agshumate mentioned here? Please double check your expression file format and make sure everything looks fine.

Hey @SaberHQ, I am using NanoSim in genome rather than transcriptome mode with the reference genome having one sequence rather than multiple. I think it's specifically that NanoSim doesn't like a reference genome in multiple pieces, but I would be more than happy to be wrong.

I see. Would you please clarify what "reference genome in multiple pieces" exactly mean? Does that mean several FASTA files or a single FASTA file with several lines in it? I believe NanoSim can handle a FASTA file with multiple lines. I will double check that to make sure.

molleraj commented 3 years ago

Hey @SaberHQ, I am using NanoSim in genome rather than transcriptome mode with the reference genome having one sequence rather than multiple. I think it's specifically that NanoSim doesn't like a reference genome in multiple pieces, but I would be more than happy to be wrong.

I see. Would you please clarify what "reference genome in multiple pieces" exactly mean? Does that mean several FASTA files or a single FASTA file with several lines in it? I believe NanoSim can handle a FASTA file with multiple lines. I will double check that to make sure.

@SaberHQ I mean a FASTA file with multiple entries (>1 ... >2 ... >3 ... and so on). There are most certainly multiple lines, but simulation only works when these are all under one id (>single_sequence ...). I hope this clarifies my statement.

cheny19 commented 3 years ago

Hi @molleraj,

Sorry for the late reply.

NanoSim can take FASTA file with multiple entries. If you concatenate all sequences into one, you may have simulated reads that span multiple amplicons.

Theoretically, if you intend to simulate a dataset with similar read length distributions as your training dataset, and the references are the same, it shouldn't hang there. But now it seems that the program ran into an infinite loop because it was trying to find a read length that can fit in the reference sequence. This could happen when the read length is close to the reference. So I'd suggest you try using the transcriptome mode, and supply a uniform expression profile to mimic the amplicons. Let me know if you have any questions when trying that out.

Thanks, Chen

dacheampong commented 2 years ago

I would like to find out if there's a limit on the number of reads one can specify for simulating metagenomic reads. I am trying to generate metagenomic reads for 20 bacteria species. Trying smaller number of reads, say 1000, I am able to generate the metagenomic reads but after increasing it to 100000, the simulation hangs up and never completes...,have it running for 2 days. Here is my command

simulator.py metagenome -gl metagenome_list.tsv -a abun.tsv -t 35 -dl dna.tsv -c training

running the code with following parameters:

genome_list metagenome_list.tsv abun abun.tsv dna_type_list dna.tsv model_prefix training out simulated perfect False kmer_bias None basecaller None strandness None sd_len None median_len None max_len inf min_len 50 abun_var None fastq False chimeric False num_threads 35 2021-10-21 01:31:28: /home/simulator.py metagenome -gl metagenome_list.tsv -a abun.tsv -t 35 -dl dna.tsv -c training 2021-10-21 01:31:28: Read in reference 2021-10-21 01:31:29: Read in abundance profile 2021-10-21 01:31:29: Read error profile 2021-10-21 01:31:29: Read KDF of unaligned reads 2021-10-21 01:31:29: Read KDF of aligned reads 2021-10-21 01:31:32: Read chimeric simulation information 2021-10-21 01:31:32: Simulating sample sample0 2021-10-21 01:31:32: Start simulation of aligned reads .

Not sure what is causing the simulation to hang, thanks

cheny19 commented 2 years ago

Hi @dacheampong , could you try with less threads? Normally we do not suggest simulating a small number of reads with so many reads, because it will affect the accuracy in abundance simulation. It is also possible to hang because it spends so much time to find a good sequence length to fit the expected abundances. It should finish within a few minutes with 1-4 threads to simulate 10000 reads according to our test.

SaberHQ commented 2 years ago

Dear @agshumate hope you are doing well. Please note that pull request #158 solves the issue with "ENS" ids in expression profiles/reference transcriptome. Sorry for this late pull request and update. This issue totally slipped my mind somehow.

As for comments from @dacheampong and @molleraj , I hope you found @cheny19's suggestions useful. Please let us know if you were able to successfully run NanoSim and use it for your analysis.

EmersonH2 commented 2 years ago

Hi @SaberHQ & @cheny19 ! I am also running into the same issue where the simulation stops and hangs. My expression_simulation.tsv only has one transcript target with est_counts = 100 & tpm = 100, and is formatted correctly. A non-empty simulation output fastq and error profile is produced, but the simulation fails to complete. Please let me know if you have any thoughts/suggestions!

running the code with following parameters:

ref_g /references/hg38.fa
ref_t /references/gencode.v39.transcripts.fa
exp expression_simulation.tsv
model_prefix /characterization_output/sample_test
out /simulated_output/sample_test
number [20000]
perfect False
kmer_bias 6
basecaller guppy
read_type cDNA_1D
model_ir False
dna_type transcriptome
strandness None
max_len inf
min_len 50
uracil False
polya None
fastq True
num_threads 1
2022-04-29 12:03:28: src/simulator.py transcriptome -rg references/hg38.fa -rt /references/gencode.v39.transcripts.fa -e expression_simulation.tsv -c /characterization_output/sample_test -o /simulated_output/sample_test -k 6 -b guppy -r cDNA_1D --no_model_ir --fastq -t 1
2022-04-29 12:03:29: Read in reference 
2022-04-29 12:03:31: Read in expression profile
2022-04-29 12:03:31: Read error profile
2022-04-29 12:03:31: Read KDF of unaligned reads
2022-04-29 12:03:31: Read KDF of aligned reads
2022-04-29 12:03:32: Read chimeric simulation information
2022-04-29 12:03:32: Start simulation of aligned reads
2022-04-29 12:03:32: Number of reads simulated >> 1^M