Open molleraj opened 3 years ago
I just tried the same with a full genome (2.8 Mb) as a template and it works great. That genome is also normalized. I wonder if I need to adjust the maximum read length to less than the smallest contig.
Hi @molleraj Thanks for using NanoSim.
So when you say you used the full genome as a template, do you mean you used it for training? Basically, how NanoSim works is that it learns the read length distribution in the training phase and uses Kernel Density Distribution (KDE) to sample from that distribution in the simulation phase. Can you elaborate a bit more on what you changed in your second try that made it work.
I am also pinging @cheny19 to take a look at this.
I also experience the issue of simulatory.py hanging and not actually simulating any reads. Here is my command and parameters
ref_g GRCh38.fa ref_t hg38c_protein_and_lncRNA_sorted.fa exp hg38_stranded_spacer.tpm model_prefix training out simulated_dRNA number 100000 perfect False kmer_bias None basecaller guppy read_type dRNA model_ir True dna_type transcriptome strandness None max_len inf min_len 50 uracil False polya None fastq True num_threads 1 2021-03-23 14:57:33: /home/ashumate/miniconda/envs/myenv/bin/simulator.py transcriptome -rt hg38c_protein_and_lncRNA_sorted.fa -rg GRCh38.fa -e hg38_stranded_spacer.tpm -r dRNA -n 100000 --fastq -o simulated_dRNA -b guppy -c training 2021-03-23 14:57:33: Read in reference 2021-03-23 14:57:41: Read in expression profile 2021-03-23 14:57:41: Read in reference genome and create .fai index file 2021-03-23 14:57:41: Read in IR markov model 2021-03-23 14:57:41: Read in GFF3 annotation file 2021-03-23 14:57:41: Read error profile 2021-03-23 14:57:41: Read KDF of unaligned reads 2021-03-23 14:57:42: Read KDF of aligned reads 2021-03-23 14:57:43: Start simulation of aligned reads
Hi @molleraj Thanks for using NanoSim.
So when you say you used the full genome as a template, do you mean you used it for training? Basically, how NanoSim works is that it learns the read length distribution in the training phase and uses Kernel Density Distribution (KDE) to sample from that distribution in the simulation phase. Can you elaborate a bit more on what you changed in your second try that made it work.
I am also pinging @cheny19 to take a look at this.
Hey @SaberHQ, I then concatenated all amplicons into one sequence and then the simulation was successful. It seems the simulator can't handle FASTA files with multiple sequences.
my issue was this line
if transcript_id.startswith("ENS") and tpm > 0:
my transcripts in my expression file do not start with ENS so the dictionary of expression values was empty causing the program to get stuck in an infinite loop. is there a reason why tran ids must start with ENS?
@agshumate Good catch. The reason I implemented it that way is to follow the official format of annotation files. The reference transcriptomes and their annotation files contains trx ids starting with ENS. Would you mind to tell me how did you calculated your expression values? I assume you have to align bunch of reads against your reference transcriptome to quantify their expression. At that step, transcripts in your reference should usually contain ENS if I am not wrong. I guess you didn't use a reference transcriptome in your case. I can totally understand that and it is possible when someone works with draft transcriptomes.
Now that you brought this issue up, I will take a look at it and see if we can get rid of that "if" statement. I may also add more information to the README file to explain expected expression file format in details. Thanks a lot.
@molleraj I should confirm that NanoSim CAN handle FASTA files with multiple sequences. I don't think that was the reason in your case. Is it possible that it is related to the issue @agshumate mentioned here? Please double check your expression file format and make sure everything looks fine.
@SaberHQ we're using the RefSeq annotation which does not use ENS as prefixes to transcript ids. As far as i know that is specific to ensembl annotations. i deleted that if statement and it seems to be working great for me now. Not sure if it will cause any downstream problems but so far so good.
@agshumate I believe there will be no downstream problems. Removing it would be also beneficial for cases in which people use draft assembly transcriptomes of their own. I will leave it to @cheny19 to take a final look and make sure removing that "if" statement is fine.
Thanks again for reporting this.
@molleraj I should confirm that NanoSim CAN handle FASTA files with multiple sequences. I don't think that was the reason in your case. Is it possible that it is related to the issue @agshumate mentioned here? Please double check your expression file format and make sure everything looks fine.
Hey @SaberHQ, I am using NanoSim in genome rather than transcriptome mode with the reference genome having one sequence rather than multiple. I think it's specifically that NanoSim doesn't like a reference genome in multiple pieces, but I would be more than happy to be wrong.
@molleraj I should confirm that NanoSim CAN handle FASTA files with multiple sequences. I don't think that was the reason in your case. Is it possible that it is related to the issue @agshumate mentioned here? Please double check your expression file format and make sure everything looks fine.
Hey @SaberHQ, I am using NanoSim in genome rather than transcriptome mode with the reference genome having one sequence rather than multiple. I think it's specifically that NanoSim doesn't like a reference genome in multiple pieces, but I would be more than happy to be wrong.
I see. Would you please clarify what "reference genome in multiple pieces" exactly mean? Does that mean several FASTA files or a single FASTA file with several lines in it? I believe NanoSim can handle a FASTA file with multiple lines. I will double check that to make sure.
Hey @SaberHQ, I am using NanoSim in genome rather than transcriptome mode with the reference genome having one sequence rather than multiple. I think it's specifically that NanoSim doesn't like a reference genome in multiple pieces, but I would be more than happy to be wrong.
I see. Would you please clarify what "reference genome in multiple pieces" exactly mean? Does that mean several FASTA files or a single FASTA file with several lines in it? I believe NanoSim can handle a FASTA file with multiple lines. I will double check that to make sure.
@SaberHQ I mean a FASTA file with multiple entries (>1 ... >2 ... >3 ... and so on). There are most certainly multiple lines, but simulation only works when these are all under one id (>single_sequence ...). I hope this clarifies my statement.
Hi @molleraj,
Sorry for the late reply.
NanoSim can take FASTA file with multiple entries. If you concatenate all sequences into one, you may have simulated reads that span multiple amplicons.
Theoretically, if you intend to simulate a dataset with similar read length distributions as your training dataset, and the references are the same, it shouldn't hang there. But now it seems that the program ran into an infinite loop because it was trying to find a read length that can fit in the reference sequence. This could happen when the read length is close to the reference. So I'd suggest you try using the transcriptome mode, and supply a uniform expression profile to mimic the amplicons. Let me know if you have any questions when trying that out.
Thanks, Chen
I would like to find out if there's a limit on the number of reads one can specify for simulating metagenomic reads. I am trying to generate metagenomic reads for 20 bacteria species. Trying smaller number of reads, say 1000, I am able to generate the metagenomic reads but after increasing it to 100000, the simulation hangs up and never completes...,have it running for 2 days. Here is my command
simulator.py metagenome -gl metagenome_list.tsv -a abun.tsv -t 35 -dl dna.tsv -c training
running the code with following parameters:
genome_list metagenome_list.tsv abun abun.tsv dna_type_list dna.tsv model_prefix training out simulated perfect False kmer_bias None basecaller None strandness None sd_len None median_len None max_len inf min_len 50 abun_var None fastq False chimeric False num_threads 35 2021-10-21 01:31:28: /home/simulator.py metagenome -gl metagenome_list.tsv -a abun.tsv -t 35 -dl dna.tsv -c training 2021-10-21 01:31:28: Read in reference 2021-10-21 01:31:29: Read in abundance profile 2021-10-21 01:31:29: Read error profile 2021-10-21 01:31:29: Read KDF of unaligned reads 2021-10-21 01:31:29: Read KDF of aligned reads 2021-10-21 01:31:32: Read chimeric simulation information 2021-10-21 01:31:32: Simulating sample sample0 2021-10-21 01:31:32: Start simulation of aligned reads .
Not sure what is causing the simulation to hang, thanks
Hi @dacheampong , could you try with less threads? Normally we do not suggest simulating a small number of reads with so many reads, because it will affect the accuracy in abundance simulation. It is also possible to hang because it spends so much time to find a good sequence length to fit the expected abundances. It should finish within a few minutes with 1-4 threads to simulate 10000 reads according to our test.
Dear @agshumate hope you are doing well. Please note that pull request #158 solves the issue with "ENS" ids in expression profiles/reference transcriptome. Sorry for this late pull request and update. This issue totally slipped my mind somehow.
As for comments from @dacheampong and @molleraj , I hope you found @cheny19's suggestions useful. Please let us know if you were able to successfully run NanoSim and use it for your analysis.
Hi @SaberHQ & @cheny19 ! I am also running into the same issue where the simulation stops and hangs. My expression_simulation.tsv only has one transcript target with est_counts = 100 & tpm = 100, and is formatted correctly. A non-empty simulation output fastq and error profile is produced, but the simulation fails to complete. Please let me know if you have any thoughts/suggestions!
running the code with following parameters:
ref_g /references/hg38.fa
ref_t /references/gencode.v39.transcripts.fa
exp expression_simulation.tsv
model_prefix /characterization_output/sample_test
out /simulated_output/sample_test
number [20000]
perfect False
kmer_bias 6
basecaller guppy
read_type cDNA_1D
model_ir False
dna_type transcriptome
strandness None
max_len inf
min_len 50
uracil False
polya None
fastq True
num_threads 1
2022-04-29 12:03:28: src/simulator.py transcriptome -rg references/hg38.fa -rt /references/gencode.v39.transcripts.fa -e expression_simulation.tsv -c /characterization_output/sample_test -o /simulated_output/sample_test -k 6 -b guppy -r cDNA_1D --no_model_ir --fastq -t 1
2022-04-29 12:03:29: Read in reference
2022-04-29 12:03:31: Read in expression profile
2022-04-29 12:03:31: Read error profile
2022-04-29 12:03:31: Read KDF of unaligned reads
2022-04-29 12:03:31: Read KDF of aligned reads
2022-04-29 12:03:32: Read chimeric simulation information
2022-04-29 12:03:32: Start simulation of aligned reads
2022-04-29 12:03:32: Number of reads simulated >> 1^M
Hi there,
I am trying to simulate nanopore amplicon reads (~6.5 kbp longest amplicon, ~37 kb total sequence) but the simulation part seems to hang and never proceed. I was able to train an error profile:
But when I run the simulation, I get preliminary files made and then the simulation just hangs for hours. How long should I expect a simulation of ~300,000 reads to take?
This is the command I am using for simulation. I get no errors when running, it just stops doing anything after the log given.
Any help would be greatly appreciated.
Jon