Closed XavierGrand closed 2 years ago
Hello @XavierGrand Thanks for your interest in using NanoSim.
As you guessed correctly, it doesn't look normal as it should only take seconds for NanoSim to simulate 100 reads. I do see that you trained your profiles with "--no_model_fit" option. Although that is useful in some cases where we are troubleshooting the pipeline or we are only interested in some specific features to be trained, it might be the reason why the simulation is not working. That is because with that option enabled, NanoSim skips the model training stage.
What I suggest you first is to train NanoSim without that option and then use those profiles for simulation.
Since I do not have access for reads and references you are using, I won't be able to replicate the same issue. What I suggest is that you use one of the pre trained models provided with NanoSim release and see if you can make it run with those trained models.
As for expression profile file, here is more information from the README file:
The expression profile is a tsv file containing expression levels of each isoform to be simulated. Users can use the output of
quantify
mode as template for modify or use the following format for constructing a new expression profile.The first row is header row specifying the format of the file.
target_id
is the id of each transcript. Thetpm
column is used for simulating, while theest_count
is just a placeholder to be compatible with the output of thequantify
mode and other quantification tools such as Salmon.The following rows are entries for each transcript isoform and the id of which needs to exist in the provided reference transcriptome. The id should start with
ENS
.Example:
target_id est_counts tpm ENST00000222247.9 2307.2992 3145.3749 ENST00000274065.8 2641.9534 3601.5848 ENST00000400259.5 623.6130 850.1268 ENST00000344548.7 1828.3466 2492.4533 ENST00000484610.5 766.3528 1044.7137
Hope this helps. Please try my suggestions and let me know how it works. Feel free to ask more questions.
Thanks @SaberHQ,
I took into account your comments and succeed in using pre-trained mus musculus profile and simulate reads for it. Then, when I tried to simulate reads on my own species, I obtain thiese messages :
running the code with following parameters:
infile fastq/filt390bp_porechoped_barcode01.fastq ref_g genome.fasta ref_t transcriptome.fasta annot aligner minimap2 g_alnm t_alnm prefix Nanosim_profile num_threads 4 model_fit True intron_retention False 2022-03-28 15:23:03: Read pre-process and unaligned reads analysis 2022-03-28 15:23:08: Alignment with minimap2 to reference transcriptome [M::mm_idx_gen::0.0022.17] collected minimizers [M::mm_idx_gen::0.0032.29] sorted minimizers [M::main::0.0032.28] loaded/built the index for 30 target sequence(s) [M::mm_mapopt_update::0.0032.25] mid_occ = 53 [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 30 [M::mm_idx_stat::0.0032.22] distinct minimizers: 665 (6.47% are singletons); average occurrences: 17.065; average spacing: 5.389 [M::worker_pipeline::159.2923.86] mapped 307337 sequences [M::worker_pipeline::269.9983.83] mapped 245187 sequences [M::main] Version: 2.17-r941 [M::main] CMD: minimap2 --cs -ax map-ont -t 4 [...] Nanosim_profile_processed.fasta [M::main] Real time: 269.998 sec; CPU: 1034.817 sec; Peak RSS: 2.099 GB 2022-03-28 15:27:38: Alignment with minimap2 to reference genome [M::mm_idx_gen::0.0012.15] collected minimizers [M::mm_idx_gen::0.0012.37] sorted minimizers [M::main::0.0022.35] loaded/built the index for 1 target sequence(s) [M::mm_mapopt_update::0.0022.28] mid_occ = 3 [M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 1 [M::mm_idx_stat::0.0022.20] distinct minimizers: 1089 (92.56% are singletons); average occurrences: 1.074; average spacing: 2.962 [M::worker_pipeline::93.2673.90] mapped 307337 sequences [M::worker_pipeline::163.6183.84] mapped 245187 sequences [M::main] Version: 2.17-r941 [M::main] CMD: minimap2 --cs -ax splice -t 4 [...] /Nanosim_profile_processed.fasta [M::main] Real time: 163.619 sec; CPU: 627.523 sec; Peak RSS: 1.195 GB 2022-03-28 15:30:22: Processing transcriptome alignment file: sam 2022-03-28 15:30:55: Processing genome alignment file: sam 2022-03-28 15:31:22: Aligned reads analysis 2022-03-28 15:31:27: Computing KDEds processed >> 256001 2022-03-28 15:31:27: Computing 2D KDE for transcriptome ref length 2022-03-28 15:31:27: Computing KDE for aligned region 2022-03-28 15:31:27: Computing KDE for aligned reads 2022-03-28 15:31:27: Computing KDE for unaligned length 2022-03-28 15:31:27: Computing KDE for ht ratio 2022-03-28 15:31:27: Unaligned reads analysis 2022-03-28 15:31:27: match and error models 2022-03-28 15:32:51: Model fitting 2022-03-28 15:32:53: Mismatch fitting start Mismatch parameters: 0.8990327193185614 0.9160827439065249 0.09226311383504293 Residual 4.688513626083246e-05 2022-03-28 15:34:27: Mismatch fitting done 2022-03-28 15:34:27: Insertion fitting start WARNING! Insertion parameters may not be optimal! 0.8319785252459806 0.9713933934740555 0.0742370502091307 0.9928579164236206 Residual 0.0006729881125697723 2022-03-28 15:42:35: Insertion fitting done 2022-03-28 15:42:35: Deletion fitting start WARNING! Deletion parameters may not be optimal! 0.9075676339458383 1.049772636478705 0.38178101449828694 0.8497803391445375 Residual 0.0004642599842717976 2022-03-28 15:46:28: Deletion fitting done 2022-03-28 15:46:28: Finished!
running the code with following parameters:
ref_g genome.fasta
ref_t transcriptome.fasta
exp expression_profile.tsv
model_prefix Nanosim_profile
out Nanosim/
number [20000]
perfect False
kmer_bias 0
basecaller guppy
read_type dRNA
model_ir False
dna_type transcriptome
strandness None
max_len inf
min_len 50
uracil False
polya None
fastq True
num_threads 4
[...]
2022-03-28 15:52:23: Read in reference
2022-03-28 15:52:23: Read in expression profile
2022-03-28 15:52:23: Read error profile
2022-03-28 15:52:23: Read KDF of unaligned reads
Traceback (most recent call last):
File "/home/xavier/NanoSim/src/simulator.py", line 2339, in
Then, after many tests and dependencies check, I'm stuck with this problem... Do you have any idea to solve it ?
Thanks in advance !
Xavier.
I observed something else, the Nanosim_profile_genome_alnm.bam and Nanosim_profile_transcriptome_alnm.bam files are empty, wheras the correponding sam file are not... Is it possible that is a problem ? With samtools for example ?
Hi @XavierGrand ,
As for your first error, it is a known issue reported here: #131
I also had the same problem a couple weeks ago and I what I would suggest is to either upgrade your sklearn version as @kmnip mentioned at #131 or downgrade scikit learn version to 0.22.1. You may get more help on different ways to solve this issue with scikit learn here:
https://stackoverflow.com/questions/60145652/no-module-named-sklearn-neighbors-base
As for bam/sam alignment issue, may I ask which version of NanoSim you are using? Would you please clone the latest commit and rerun your analysis. From the log file you sent above:
2022-03-28 15:30:22: Processing transcriptome alignment file: sam 2022-03-28 15:30:55: Processing genome alignment file: sam 2022-03-28 15:31:22: Aligned reads analysis
I do see that NanoSim processed sam alignment files but not bam files. Please note that with the recent pull request #150 , NanoSim creates BAM instead of SAM in transcriptome mode (undo changes in https://github.com/bcgsc/NanoSim/commit/3d046e34740965abb8bb6bba14ec31a8d9f239cc)
So as for conclusion, I suggest you to clone the latest commit and use that for your analysis. We are going to release a new version incorporating all these recent updates soon. Feel free to ask more questions.
Cheers.
Hi @SaberHQ,
Following your last comments, I checked dependencies versions (conda_list.txt), and bam files are processed in read_analysis step (line 41 in read_analysis_ann.log). Something is still not clear at lines 59 and 60 (read_analysis_ann.log). read_analysis_ann.log conda_list.txt
59 /home/xavier/miniconda3/envs/nanosim/bin/mixed_model.py:24: RuntimeWarning: overflow encountered in power 60 wei_cdf = 1 - np.exp(-1 * np.power(x / l, k))
What does it mean ?
Then the simulation step is still stuck... What information do you need to help me ?
If the problem, in your opinion, comes from dependencies, is it possible for you to provide a docker or singularity container ?
Thanks for your patience and your help.
Xav.
@XavierGrand Try this:
docker run quay.io/biocontainers/nanosim
Don't worry about the RuntimeWarning
from numpy. It wasn't an error, FYI: https://www.statology.org/runtimewarning-overflow-encountered-in-exp/
From your log file, it looks like the read_analysis.py
command completed successfully.
Ok guys,
I figured out the installation and dependencies issues. I tried a genomic simulation, that worked ! I simulated reads.
But, the transcriptome reads simulation is still hanged, "Start simulation of aligned reads", never ending... With SAME input files... Any idea ?
Thanks.
Xav.
Which version are you using? Are you using the code from the master branch? And, can you please post your log file? Thanks!
Hi @kmnip,
My NanoSim version : nanosim 3.0.2 hdfd78af_0 bioconda
I attached my log files, genome and transcriptome simulation, and conda list.
I hope you'll be able to help me !
Xav.
read_analysis_genome.log simulator_genome_100.log
Hi @XavierGrand
There are two things that you can try. Please do repeat training and simulation for each.
cd /your/new/directory/for/nanosim
git clone https://github.com/bcgsc/NanoSim.git
You can activate your existing environment for NanoSim, but specify the path to the master branch's read_analysis.py
and simulator.py
when you run them.
conda activate nanosim
cd /your/working/directory
/your/new/directory/for/nanosim/NanoSim/read_analysis.py transcriptome -i barcode02.fastq -rg genome.fasta -rt transcriptome.fasta -o Profile --no_intron_retention -t 4
/your/new/directory/for/nanosim/NanoSim/simulator.py transcriptome -rt transcriptome.fasta -rg genome.fasta -e expression_profile.tsv -c Profile -o Training/ -n 100 -min 350 -max 3450 --fastq -b guppy -r dRNA --no_model_ir -k 0 -t 4
2. Update scikit-learn to latest version.
conda activate nanosim conda update scikit-learn
Repeat training and simulation again.
I think there may be a potential bug here where only ENSEMBL transcripts (i.e. those with names beginning with ENS
) is accepted:
https://github.com/bcgsc/NanoSim/blob/b522a0b4b9f49e40b41128edca429ef1540459b1/src/simulator.py#L392
@XavierGrand, Can you please report whether any transcript names in your expression_profile.tsv
start with "ENS"?
wc -l expression_profile.tsv
grep '^ENS' expression_profile.tsv | wc -l
If the grep
command returns 0
, then this confirms that there is indeed a bug as I mentioned in the previous comment.
Problem solved. It was the "ENS" starting names. I don't understand why only ENSEMBL transcripts are accepted.
Thank you all very much for your help !
Xav.
Hi NanoSim team !
Trying to simulate HBV nanopore reads, I performed the read analysis step (almost) without troubles with following parameters:
read_analysis.py transcriptome -i /home/xavier/Bureau/Simulation_Test/Ref/filt390bp_porechoped_barcode02.fastq \ -rg /home/xavier/Bureau/Simulation_Test/Ref/genome.fa \ -rt /home/xavier/Bureau/Simulation_Test/Ref/transcriptome.fasta \ -o /home/xavier/Bureau/Simulation_Test/Profile/Training \ --no_intron_retention \ --no_model_fit \ -t 4
running the code with following parameters:
infile /home/xavier/Bureau/Simulation_Test/Ref/filt390bp_porechoped_barcode02.fastq ref_g /home/xavier/Bureau/Simulation_Test/Ref/genome.fa ref_t /home/xavier/Bureau/Simulation_Test/Ref/transcriptome.fasta annot aligner minimap2 g_alnm t_alnm prefix /home/xavier/Bureau/Simulation_Test/Profile/Training num_threads 4 model_fit False intron_retention False [...] 2022-03-23 12:48:46: Finished!
Then, I tried the simulation step, with the following parameters, and I am stuck few hours later still "simulating"... That is not consistent with your publication.
simulator.py transcriptome -rt /home/xavier/Bureau/Simulation_Test/Ref/transcriptome.fasta \ -rg /home/xavier/Bureau/Simulation_Test/Ref/genome.fa \ -e /home/xavier/Bureau/Simulation_Test/Ref/expression_profile.tsv \ -c /home/xavier/Bureau/Simulation_Test/Profile/Training \ -o /home/xavier/Bureau/Simulation_Test/Training/ \ -n 100 \ -min 350 \ --fastq \ -b guppy \ -r dRNA \ --no_model_ir \ -k 0 \ -t 4
running the code with following parameters:
ref_g /home/xavier/Bureau/Simulation_Test/Ref/genome.fa ref_t /home/xavier/Bureau/Simulation_Test/Ref/transcriptome.fasta exp /home/xavier/Bureau/Simulation_Test/Ref/expression_profile.tsv model_prefix /home/xavier/Bureau/Simulation_Test/Profile/Training out /home/xavier/Bureau/Simulation_Test/Training/ number [100] perfect False kmer_bias 0 basecaller guppy read_type dRNA model_ir False dna_type transcriptome strandness None max_len inf min_len 350 uracil False polya None fastq True num_threads 4 [...] 2022-03-23 12:58:16: Start simulation of aligned reads
And it's still running.
Do you see something I did wrong ?
About the expression_profile.tsv file, does the est_counts values need to be real ? in this case, how is it possible to tune the desired TPMs ?
Can you help me please ?
Thanks, Xav.