bcgsc / NanoSim

Nanopore sequence read simulator
Other
217 stars 51 forks source link

Reads simulation stuck in an infinite loop or something... #156

Closed XavierGrand closed 2 years ago

XavierGrand commented 2 years ago

Hi NanoSim team !

Trying to simulate HBV nanopore reads, I performed the read analysis step (almost) without troubles with following parameters:

read_analysis.py transcriptome -i /home/xavier/Bureau/Simulation_Test/Ref/filt390bp_porechoped_barcode02.fastq \ -rg /home/xavier/Bureau/Simulation_Test/Ref/genome.fa \ -rt /home/xavier/Bureau/Simulation_Test/Ref/transcriptome.fasta \ -o /home/xavier/Bureau/Simulation_Test/Profile/Training \ --no_intron_retention \ --no_model_fit \ -t 4

running the code with following parameters:

infile /home/xavier/Bureau/Simulation_Test/Ref/filt390bp_porechoped_barcode02.fastq ref_g /home/xavier/Bureau/Simulation_Test/Ref/genome.fa ref_t /home/xavier/Bureau/Simulation_Test/Ref/transcriptome.fasta annot aligner minimap2 g_alnm t_alnm prefix /home/xavier/Bureau/Simulation_Test/Profile/Training num_threads 4 model_fit False intron_retention False [...] 2022-03-23 12:48:46: Finished!

Then, I tried the simulation step, with the following parameters, and I am stuck few hours later still "simulating"... That is not consistent with your publication.

simulator.py transcriptome -rt /home/xavier/Bureau/Simulation_Test/Ref/transcriptome.fasta \ -rg /home/xavier/Bureau/Simulation_Test/Ref/genome.fa \ -e /home/xavier/Bureau/Simulation_Test/Ref/expression_profile.tsv \ -c /home/xavier/Bureau/Simulation_Test/Profile/Training \ -o /home/xavier/Bureau/Simulation_Test/Training/ \ -n 100 \ -min 350 \ --fastq \ -b guppy \ -r dRNA \ --no_model_ir \ -k 0 \ -t 4

running the code with following parameters:

ref_g /home/xavier/Bureau/Simulation_Test/Ref/genome.fa ref_t /home/xavier/Bureau/Simulation_Test/Ref/transcriptome.fasta exp /home/xavier/Bureau/Simulation_Test/Ref/expression_profile.tsv model_prefix /home/xavier/Bureau/Simulation_Test/Profile/Training out /home/xavier/Bureau/Simulation_Test/Training/ number [100] perfect False kmer_bias 0 basecaller guppy read_type dRNA model_ir False dna_type transcriptome strandness None max_len inf min_len 350 uracil False polya None fastq True num_threads 4 [...] 2022-03-23 12:58:16: Start simulation of aligned reads

And it's still running.

Do you see something I did wrong ?

About the expression_profile.tsv file, does the est_counts values need to be real ? in this case, how is it possible to tune the desired TPMs ?

Can you help me please ?

Thanks, Xav.

SaberHQ commented 2 years ago

Hello @XavierGrand Thanks for your interest in using NanoSim.

As you guessed correctly, it doesn't look normal as it should only take seconds for NanoSim to simulate 100 reads. I do see that you trained your profiles with "--no_model_fit" option. Although that is useful in some cases where we are troubleshooting the pipeline or we are only interested in some specific features to be trained, it might be the reason why the simulation is not working. That is because with that option enabled, NanoSim skips the model training stage.

What I suggest you first is to train NanoSim without that option and then use those profiles for simulation.

Since I do not have access for reads and references you are using, I won't be able to replicate the same issue. What I suggest is that you use one of the pre trained models provided with NanoSim release and see if you can make it run with those trained models.

As for expression profile file, here is more information from the README file:

The expression profile is a tsv file containing expression levels of each isoform to be simulated. Users can use the output of quantify mode as template for modify or use the following format for constructing a new expression profile.

The first row is header row specifying the format of the file. target_id is the id of each transcript. The tpm column is used for simulating, while the est_count is just a placeholder to be compatible with the output of the quantify mode and other quantification tools such as Salmon.

The following rows are entries for each transcript isoform and the id of which needs to exist in the provided reference transcriptome. The id should start with ENS.

Example:

target_id est_counts tpm
ENST00000222247.9 2307.2992 3145.3749
ENST00000274065.8 2641.9534 3601.5848
ENST00000400259.5 623.6130 850.1268
ENST00000344548.7 1828.3466 2492.4533
ENST00000484610.5 766.3528 1044.7137

Hope this helps. Please try my suggestions and let me know how it works. Feel free to ask more questions.

XavierGrand commented 2 years ago

Thanks @SaberHQ,

I took into account your comments and succeed in using pre-trained mus musculus profile and simulate reads for it. Then, when I tried to simulate reads on my own species, I obtain thiese messages :

Read Analysis:

running the code with following parameters:

infile fastq/filt390bp_porechoped_barcode01.fastq ref_g genome.fasta ref_t transcriptome.fasta annot aligner minimap2 g_alnm t_alnm prefix Nanosim_profile num_threads 4 model_fit True intron_retention False 2022-03-28 15:23:03: Read pre-process and unaligned reads analysis 2022-03-28 15:23:08: Alignment with minimap2 to reference transcriptome [M::mm_idx_gen::0.0022.17] collected minimizers [M::mm_idx_gen::0.0032.29] sorted minimizers [M::main::0.0032.28] loaded/built the index for 30 target sequence(s) [M::mm_mapopt_update::0.0032.25] mid_occ = 53 [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 30 [M::mm_idx_stat::0.0032.22] distinct minimizers: 665 (6.47% are singletons); average occurrences: 17.065; average spacing: 5.389 [M::worker_pipeline::159.2923.86] mapped 307337 sequences [M::worker_pipeline::269.9983.83] mapped 245187 sequences [M::main] Version: 2.17-r941 [M::main] CMD: minimap2 --cs -ax map-ont -t 4 [...] Nanosim_profile_processed.fasta [M::main] Real time: 269.998 sec; CPU: 1034.817 sec; Peak RSS: 2.099 GB 2022-03-28 15:27:38: Alignment with minimap2 to reference genome [M::mm_idx_gen::0.0012.15] collected minimizers [M::mm_idx_gen::0.0012.37] sorted minimizers [M::main::0.0022.35] loaded/built the index for 1 target sequence(s) [M::mm_mapopt_update::0.0022.28] mid_occ = 3 [M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 1 [M::mm_idx_stat::0.0022.20] distinct minimizers: 1089 (92.56% are singletons); average occurrences: 1.074; average spacing: 2.962 [M::worker_pipeline::93.2673.90] mapped 307337 sequences [M::worker_pipeline::163.6183.84] mapped 245187 sequences [M::main] Version: 2.17-r941 [M::main] CMD: minimap2 --cs -ax splice -t 4 [...] /Nanosim_profile_processed.fasta [M::main] Real time: 163.619 sec; CPU: 627.523 sec; Peak RSS: 1.195 GB 2022-03-28 15:30:22: Processing transcriptome alignment file: sam 2022-03-28 15:30:55: Processing genome alignment file: sam 2022-03-28 15:31:22: Aligned reads analysis 2022-03-28 15:31:27: Computing KDEds processed >> 256001 2022-03-28 15:31:27: Computing 2D KDE for transcriptome ref length 2022-03-28 15:31:27: Computing KDE for aligned region 2022-03-28 15:31:27: Computing KDE for aligned reads 2022-03-28 15:31:27: Computing KDE for unaligned length 2022-03-28 15:31:27: Computing KDE for ht ratio 2022-03-28 15:31:27: Unaligned reads analysis 2022-03-28 15:31:27: match and error models 2022-03-28 15:32:51: Model fitting 2022-03-28 15:32:53: Mismatch fitting start Mismatch parameters: 0.8990327193185614 0.9160827439065249 0.09226311383504293 Residual 4.688513626083246e-05 2022-03-28 15:34:27: Mismatch fitting done 2022-03-28 15:34:27: Insertion fitting start WARNING! Insertion parameters may not be optimal! 0.8319785252459806 0.9713933934740555 0.0742370502091307 0.9928579164236206 Residual 0.0006729881125697723 2022-03-28 15:42:35: Insertion fitting done 2022-03-28 15:42:35: Deletion fitting start WARNING! Deletion parameters may not be optimal! 0.9075676339458383 1.049772636478705 0.38178101449828694 0.8497803391445375 Residual 0.0004642599842717976 2022-03-28 15:46:28: Deletion fitting done 2022-03-28 15:46:28: Finished!

Read Simulation :

running the code with following parameters:

ref_g genome.fasta ref_t transcriptome.fasta exp expression_profile.tsv model_prefix Nanosim_profile out Nanosim/ number [20000] perfect False kmer_bias 0 basecaller guppy read_type dRNA model_ir False dna_type transcriptome strandness None max_len inf min_len 50 uracil False polya None fastq True num_threads 4 [...] 2022-03-28 15:52:23: Read in reference 2022-03-28 15:52:23: Read in expression profile 2022-03-28 15:52:23: Read error profile 2022-03-28 15:52:23: Read KDF of unaligned reads Traceback (most recent call last): File "/home/xavier/NanoSim/src/simulator.py", line 2339, in main() File "/home/xavier/NanoSim/src/simulator.py", line 2201, in main model_ir=model_ir, polya=polya, exp=exp) File "/home/xavier/NanoSim/src/simulator.py", line 511, in read_profile kde_unaligned = joblib.load(model_prefix + "_unaligned_length.pkl") File "/home/xavier/anaconda3/envs/nanosim/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 605, in load obj = _unpickle(fobj, filename, mmap_mode) File "/home/xavier/anaconda3/envs/nanosim/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 529, in _unpickle obj = unpickler.load() File "/home/xavier/anaconda3/envs/nanosim/lib/python3.7/pickle.py", line 1088, in load dispatchkey[0] File "/home/xavier/anaconda3/envs/nanosim/lib/python3.7/pickle.py", line 1376, in load_global klass = self.find_class(module, name) File "/home/xavier/anaconda3/envs/nanosim/lib/python3.7/pickle.py", line 1426, in find_class import(module, level=0) ModuleNotFoundError: No module named 'sklearn.neighbors._kde'

Then, after many tests and dependencies check, I'm stuck with this problem... Do you have any idea to solve it ?

Thanks in advance !

Xavier.

XavierGrand commented 2 years ago

I observed something else, the Nanosim_profile_genome_alnm.bam and Nanosim_profile_transcriptome_alnm.bam files are empty, wheras the correponding sam file are not... Is it possible that is a problem ? With samtools for example ?

SaberHQ commented 2 years ago

Hi @XavierGrand ,

As for your first error, it is a known issue reported here: #131

I also had the same problem a couple weeks ago and I what I would suggest is to either upgrade your sklearn version as @kmnip mentioned at #131 or downgrade scikit learn version to 0.22.1. You may get more help on different ways to solve this issue with scikit learn here:

https://stackoverflow.com/questions/60145652/no-module-named-sklearn-neighbors-base

As for bam/sam alignment issue, may I ask which version of NanoSim you are using? Would you please clone the latest commit and rerun your analysis. From the log file you sent above:

2022-03-28 15:30:22: Processing transcriptome alignment file: sam 2022-03-28 15:30:55: Processing genome alignment file: sam 2022-03-28 15:31:22: Aligned reads analysis

I do see that NanoSim processed sam alignment files but not bam files. Please note that with the recent pull request #150 , NanoSim creates BAM instead of SAM in transcriptome mode (undo changes in https://github.com/bcgsc/NanoSim/commit/3d046e34740965abb8bb6bba14ec31a8d9f239cc)

So as for conclusion, I suggest you to clone the latest commit and use that for your analysis. We are going to release a new version incorporating all these recent updates soon. Feel free to ask more questions.

Cheers.

XavierGrand commented 2 years ago

Hi @SaberHQ,

Following your last comments, I checked dependencies versions (conda_list.txt), and bam files are processed in read_analysis step (line 41 in read_analysis_ann.log). Something is still not clear at lines 59 and 60 (read_analysis_ann.log). read_analysis_ann.log conda_list.txt

59 /home/xavier/miniconda3/envs/nanosim/bin/mixed_model.py:24: RuntimeWarning: overflow encountered in power 60 wei_cdf = 1 - np.exp(-1 * np.power(x / l, k))

What does it mean ?

Then the simulation step is still stuck... What information do you need to help me ?

If the problem, in your opinion, comes from dependencies, is it possible for you to provide a docker or singularity container ?

Thanks for your patience and your help.

Xav.

kmnip commented 2 years ago

@XavierGrand Try this:

docker run quay.io/biocontainers/nanosim
kmnip commented 2 years ago

Don't worry about the RuntimeWarning from numpy. It wasn't an error, FYI: https://www.statology.org/runtimewarning-overflow-encountered-in-exp/

From your log file, it looks like the read_analysis.py command completed successfully.

XavierGrand commented 2 years ago

Ok guys,

I figured out the installation and dependencies issues. I tried a genomic simulation, that worked ! I simulated reads.

But, the transcriptome reads simulation is still hanged, "Start simulation of aligned reads", never ending... With SAME input files... Any idea ?

Thanks.

Xav.

kmnip commented 2 years ago

Which version are you using? Are you using the code from the master branch? And, can you please post your log file? Thanks!

XavierGrand commented 2 years ago

Hi @kmnip,

My NanoSim version : nanosim 3.0.2 hdfd78af_0 bioconda

I attached my log files, genome and transcriptome simulation, and conda list.

I hope you'll be able to help me !

Xav.

read_analysis_genome.log simulator_genome_100.log

read_analysis_annon.log simulator.log

conda_list.txt

kmnip commented 2 years ago

Hi @XavierGrand

There are two things that you can try. Please do repeat training and simulation for each.

  1. Use the most up-to-date code in the master branch.
    cd /your/new/directory/for/nanosim
    git clone https://github.com/bcgsc/NanoSim.git

    You can activate your existing environment for NanoSim, but specify the path to the master branch's read_analysis.py and simulator.py when you run them.

    
    conda activate nanosim
    cd /your/working/directory

/your/new/directory/for/nanosim/NanoSim/read_analysis.py transcriptome -i barcode02.fastq -rg genome.fasta -rt transcriptome.fasta -o Profile --no_intron_retention -t 4

/your/new/directory/for/nanosim/NanoSim/simulator.py transcriptome -rt transcriptome.fasta -rg genome.fasta -e expression_profile.tsv -c Profile -o Training/ -n 100 -min 350 -max 3450 --fastq -b guppy -r dRNA --no_model_ir -k 0 -t 4


2. Update scikit-learn to latest version.

conda activate nanosim conda update scikit-learn


Repeat training and simulation again.
kmnip commented 2 years ago

I think there may be a potential bug here where only ENSEMBL transcripts (i.e. those with names beginning with ENS) is accepted: https://github.com/bcgsc/NanoSim/blob/b522a0b4b9f49e40b41128edca429ef1540459b1/src/simulator.py#L392

kmnip commented 2 years ago

@XavierGrand, Can you please report whether any transcript names in your expression_profile.tsv start with "ENS"?

wc -l expression_profile.tsv
grep '^ENS' expression_profile.tsv | wc -l

If the grep command returns 0, then this confirms that there is indeed a bug as I mentioned in the previous comment.

XavierGrand commented 2 years ago

Problem solved. It was the "ENS" starting names. I don't understand why only ENSEMBL transcripts are accepted.

Thank you all very much for your help !

Xav.