Open aastha-batta opened 1 year ago
Can you please report the exact command? And are you using the code from the master branch?
simulator.py metagenome -gl metagenome_list.tsv -a abundance.tsv -dl dna_type.tsv -c /home/devuser/omega_id/nanosim/NanoSim/pre-trained_models/metagenome_ERR3152364_Even/training -o bacteria --fastq -b guppy -t 10 --abun_var -0.5 0.5 --chimeric --max_len 1700 this is the command we are using.
and yes we used it from master branch
As a test, you could try taking out these options: --abun_var -0.5 0.5
and --max_len 1700
I ran the code removing --max_len 1700
. The process is running fine. Out of the nine species DNA sequence provided as input, eight are not longer than 1700bp. The remaining one is whole genome sequence in single fasta which is long ~4500000bp. When I removed this single one (again without --max_len 1700
), the process does not complete as mentioned in the original issue. What does it mean? How can I use nanosim to simulate small reads <1700bp?
Hi @aastha-batta Do you get any sort of error messages in those cases or is the script stuck with no output? I suspect this might be also related to issue #185
We will look into it. Thanks for reporting this.
Not only this if we are making any small change the process gets stuck and until we restart the terminal it does not run successfully.
Would you also kindly provide more information on this @aastha-batta ? What do you mean by making "any small change"?
Hello @SaberHQ OK, so basically i tried changing the number of entries in the genome list and the subsequent lists to see if the script simulator.py is able to generate the results every time i run it successfully. It was not happening. I am yet to try this with a self trained model. I was not able to point out the exact reason for it to get stuck, especially when we are using the --max_reads flag.
I went through the issue #185 and i went through the code again after reading the comment posted. If the length variable mentioned in extract_reads
and as mentioned in the comment is indeed about --max_len
and sequence length passed in the inputs (genome_list) then we are using --max_len 1700
which is larger than the sequence lengths present in the genome_list i passed in my input and i think extract_read
is handling that condition well. And i ran the command mentioned by @aastha-batta me in my previous comments simulator.py metagenome -gl metagenome_list.tsv -a abundance.tsv -dl dna_type.tsv -c /home/devuser/omega_id/nanosim/NanoSim/pre-trained_models/metagenome_ERR3152364_Even/training -o bacteria --fastq -b guppy -t 10 --abun_var -0.5 0.5 --chimeric --max_len 1700
, if i run this in a loop (5 times) it's giving a success rate of 40% i.e it completes successfully twice and the rest of the times i had to interrupt. This is an unusual behavior as the process should not stop so abruptly if it is getting completed once or twice. Also, there is a whole genome sequence available in the list which is ~4500000bp long and if we remove this the simulation is not getting completed at all even if i remove --max_len
flag. I previously theorized that since we are using an old pre-trained model this anomaly is coming up, there might be some compatibility issues with this model?.. i timed some of the functions in the script there is no defined step where it was getting stuck so i really don't know what is happening. The problem reported in issue #185 might be something but i don't think it is related to my issue. Kindly let me know if my hunch is correct (for my problem) or if there is something else i need to check or if there is any function in simulator.py i should check and run separately to record it's behavior. Thank You.
Also, @SaberHQ I was using the conda channel - bioconda nanosim package and since it has not been updated for the most recent commit in simulator.py
it is using the older version of simulator.py
and is generating error profiles incorrectly.
I have been trying to undestand the simulator.py
script in metagenome
mode and when it gets stuck in an infinite loop.
Let's say that I have genomes with max length of 1000 and I specify --max_len 1500
. In the function extract_read
,
if the length given to the function is >1000, the code:
if not s or length > max(seq_len[s].values()):
s = random.choice(list(seq_len.keys()))
will change the species (variable s
) but it will never find a species with the desired length.
Other case is the one I reported in #185, when length == max(seq_len[s].values())
.
Now let's say that I have genomes with max length of 1000 and I specify --max_len 999
. In theory this shouldn't get stuck but when adding mutations in the next code the list of lengths (variable seg_length_list
in this case) gets modified and the lengths could now be >1000 and get stuck.
for each_ref in ref_length_list:
middle, middle_ref, error_dict, error_count = \
error_list(each_ref, match_markov_model, match_ht_list, error_par, trans_error_pr, fastq)
total += middle
seg_length_list.append(middle_ref)
seg_error_dict_list.append(error_dict)
seg_error_count_list.append(error_count)
In my tests if I specify --max_len
of 100 less than the minimum genome length it never gets stuck in an infinite loop.
I am experiencing similar issues. I use the metagenome mode to simulate 24 mixed composition samples (norovirus). I input amplicons of 570 bp in a log normal distribution of abundance, with two to ten genotypes per sample. I have uploaded an example of my input files. I have tried running this with and without max and minimum lengths, with and without --abun_var, with the even and log metagenome training files. I have attached the sample input files in the zip folder. As all my input sequences are 570/579 bp, is this creating the hang issue?
This is what I run on a HPC cluster: `cd simulated_sequencing
module load nanosim/3.1.0 source activate nanosim
simulator.py metagenome \ -gl genome_list.tsv \ -a abun.tsv \ -dl dna_type.tsv \ --chimeric \ -b guppy \ --fastq \ --seed 10032 \ -t 10 \ -c NanoSim/pre-trained_models/metagenome_ERR3152364_Even/training
conda deactivate module unload nanosim/3.1.0
conda deactivate ` The simulation begins but hangs and never completes. Sometimes nothing is written to any of the files depending on the combination of conditions (lengths, training files). In the example provided I have a few error profiles and fastq outputs generated:
Thanks for your help with this matter. I would really like to apply NanoSim to this case. so any advice is appreciated.
Thank you @ahfitzpa and everyone else for reporting this. I am sorry to hear that you had problems running the tool.
My colleagues and I will look at it and fix it. I hope you can expect to hear back from us in 3 weeks.
@kmnip @cheny19 @theottlo
if I specify --max_len of 100 less than the minimum genome length it never gets stuck in an infinite loop.
This worked because the ~100 nt buffer accounted for the insertion error bases (and possibly head/tail regions)
I think @danpal96 is right about the buggy parts of the code.
Hi @danpal96, @ahfitzpa , @aastha-batta , and @rameezmj
I have created pull request #189. Can you please check whether my changes would resolve the hanging issues on your end?
@kmnip I made some small tests and it didnt get stuck, I will do more tests when I have more time. Sorry for the late response for some reason github is not notifying of the mentions
@danpal96 No worries, we appreciate your help!
Hi @kmnip Yes, we will look into the changes. I am sorry I was unable to reply to this. Will let you know asap.
Hi @danpal96, @ahfitzpa , @aastha-batta , and @rameezmj
I have created pull request #189. Can you please check whether my changes would resolve the hanging issues on your end?
Thanks, @kmnip. This pull request has also been tested and merged into the master branch.
OK, So, We are using a pretrained model for metagenome simulations, We have a standard read count and 2 samples are getting generated, however when we are changing the no. of species that should be included in the sample the process hangs in the middle and does not finish. the read count is 100 and 1000 so it's not very large for NanoSim to run simulations, I am not understanding why is this happening. Not only this if we are making any small change the process gets stuck and until we restart the terminal it does not run successfully.