bcgsc / NanoSim

Nanopore sequence read simulator
Other
247 stars 57 forks source link

Nanosim hangs in the middle #184

Open aastha-batta opened 1 year ago

aastha-batta commented 1 year ago

OK, So, We are using a pretrained model for metagenome simulations, We have a standard read count and 2 samples are getting generated, however when we are changing the no. of species that should be included in the sample the process hangs in the middle and does not finish. the read count is 100 and 1000 so it's not very large for NanoSim to run simulations, I am not understanding why is this happening. Not only this if we are making any small change the process gets stuck and until we restart the terminal it does not run successfully.

kmnip commented 1 year ago

Can you please report the exact command? And are you using the code from the master branch?

aastha-batta commented 1 year ago

simulator.py metagenome -gl metagenome_list.tsv -a abundance.tsv -dl dna_type.tsv -c /home/devuser/omega_id/nanosim/NanoSim/pre-trained_models/metagenome_ERR3152364_Even/training -o bacteria --fastq -b guppy -t 10 --abun_var -0.5 0.5 --chimeric --max_len 1700 this is the command we are using.

aastha-batta commented 1 year ago

and yes we used it from master branch

kmnip commented 1 year ago

As a test, you could try taking out these options: --abun_var -0.5 0.5 and --max_len 1700

rameezmj commented 1 year ago

I ran the code removing --max_len 1700. The process is running fine. Out of the nine species DNA sequence provided as input, eight are not longer than 1700bp. The remaining one is whole genome sequence in single fasta which is long ~4500000bp. When I removed this single one (again without --max_len 1700), the process does not complete as mentioned in the original issue. What does it mean? How can I use nanosim to simulate small reads <1700bp?

SaberHQ commented 1 year ago

Hi @aastha-batta Do you get any sort of error messages in those cases or is the script stuck with no output? I suspect this might be also related to issue #185

We will look into it. Thanks for reporting this.

SaberHQ commented 1 year ago

Not only this if we are making any small change the process gets stuck and until we restart the terminal it does not run successfully.

Would you also kindly provide more information on this @aastha-batta ? What do you mean by making "any small change"?

aastha-batta commented 1 year ago

Hello @SaberHQ OK, so basically i tried changing the number of entries in the genome list and the subsequent lists to see if the script simulator.py is able to generate the results every time i run it successfully. It was not happening. I am yet to try this with a self trained model. I was not able to point out the exact reason for it to get stuck, especially when we are using the --max_reads flag.
I went through the issue #185 and i went through the code again after reading the comment posted. If the length variable mentioned in extract_reads and as mentioned in the comment is indeed about --max_len and sequence length passed in the inputs (genome_list) then we are using --max_len 1700 which is larger than the sequence lengths present in the genome_list i passed in my input and i think extract_read is handling that condition well. And i ran the command mentioned by @aastha-batta me in my previous comments simulator.py metagenome -gl metagenome_list.tsv -a abundance.tsv -dl dna_type.tsv -c /home/devuser/omega_id/nanosim/NanoSim/pre-trained_models/metagenome_ERR3152364_Even/training -o bacteria --fastq -b guppy -t 10 --abun_var -0.5 0.5 --chimeric --max_len 1700 , if i run this in a loop (5 times) it's giving a success rate of 40% i.e it completes successfully twice and the rest of the times i had to interrupt. This is an unusual behavior as the process should not stop so abruptly if it is getting completed once or twice. Also, there is a whole genome sequence available in the list which is ~4500000bp long and if we remove this the simulation is not getting completed at all even if i remove --max_len flag. I previously theorized that since we are using an old pre-trained model this anomaly is coming up, there might be some compatibility issues with this model?.. i timed some of the functions in the script there is no defined step where it was getting stuck so i really don't know what is happening. The problem reported in issue #185 might be something but i don't think it is related to my issue. Kindly let me know if my hunch is correct (for my problem) or if there is something else i need to check or if there is any function in simulator.py i should check and run separately to record it's behavior. Thank You.

aastha-batta commented 1 year ago

Also, @SaberHQ I was using the conda channel - bioconda nanosim package and since it has not been updated for the most recent commit in simulator.py it is using the older version of simulator.py and is generating error profiles incorrectly.

danpal96 commented 1 year ago

I have been trying to undestand the simulator.py script in metagenome mode and when it gets stuck in an infinite loop.

Let's say that I have genomes with max length of 1000 and I specify --max_len 1500 . In the function extract_read , if the length given to the function is >1000, the code:

if not s or length > max(seq_len[s].values()):
    s = random.choice(list(seq_len.keys()))

will change the species (variable s) but it will never find a species with the desired length.

Other case is the one I reported in #185, when length == max(seq_len[s].values()).

Now let's say that I have genomes with max length of 1000 and I specify --max_len 999 . In theory this shouldn't get stuck but when adding mutations in the next code the list of lengths (variable seg_length_list in this case) gets modified and the lengths could now be >1000 and get stuck.

for each_ref in ref_length_list:
    middle, middle_ref, error_dict, error_count = \
        error_list(each_ref, match_markov_model, match_ht_list, error_par, trans_error_pr, fastq)
    total += middle
    seg_length_list.append(middle_ref)
    seg_error_dict_list.append(error_dict)
    seg_error_count_list.append(error_count)

In my tests if I specify --max_len of 100 less than the minimum genome length it never gets stuck in an infinite loop.

ahfitzpa commented 1 year ago

I am experiencing similar issues. I use the metagenome mode to simulate 24 mixed composition samples (norovirus). I input amplicons of 570 bp in a log normal distribution of abundance, with two to ten genotypes per sample. I have uploaded an example of my input files. I have tried running this with and without max and minimum lengths, with and without --abun_var, with the even and log metagenome training files. I have attached the sample input files in the zip folder. As all my input sequences are 570/579 bp, is this creating the hang issue?

This is what I run on a HPC cluster: `cd simulated_sequencing

module load nanosim/3.1.0 source activate nanosim

simulator.py metagenome \ -gl genome_list.tsv \ -a abun.tsv \ -dl dna_type.tsv \ --chimeric \ -b guppy \ --fastq \ --seed 10032 \ -t 10 \ -c NanoSim/pre-trained_models/metagenome_ERR3152364_Even/training

conda deactivate module unload nanosim/3.1.0

conda deactivate ` The simulation begins but hangs and never completes. Sometimes nothing is written to any of the files depending on the combination of conditions (lengths, training files). In the example provided I have a few error profiles and fastq outputs generated: image

example_simulation.zip

Thanks for your help with this matter. I would really like to apply NanoSim to this case. so any advice is appreciated.

SaberHQ commented 1 year ago

Thank you @ahfitzpa and everyone else for reporting this. I am sorry to hear that you had problems running the tool.

My colleagues and I will look at it and fix it. I hope you can expect to hear back from us in 3 weeks.

@kmnip @cheny19 @theottlo

kmnip commented 1 year ago

if I specify --max_len of 100 less than the minimum genome length it never gets stuck in an infinite loop.

This worked because the ~100 nt buffer accounted for the insertion error bases (and possibly head/tail regions)

I think @danpal96 is right about the buggy parts of the code.

kmnip commented 1 year ago

Hi @danpal96, @ahfitzpa , @aastha-batta , and @rameezmj

I have created pull request #189. Can you please check whether my changes would resolve the hanging issues on your end?

danpal96 commented 1 year ago

@kmnip I made some small tests and it didnt get stuck, I will do more tests when I have more time. Sorry for the late response for some reason github is not notifying of the mentions

kmnip commented 1 year ago

@danpal96 No worries, we appreciate your help!

aastha-batta commented 1 year ago

Hi @kmnip Yes, we will look into the changes. I am sorry I was unable to reply to this. Will let you know asap.

SaberHQ commented 1 year ago

Hi @danpal96, @ahfitzpa , @aastha-batta , and @rameezmj

I have created pull request #189. Can you please check whether my changes would resolve the hanging issues on your end?

Thanks, @kmnip. This pull request has also been tested and merged into the master branch.