bcgsc / NanoSim

Nanopore sequence read simulator
Other
217 stars 51 forks source link

Could not retrieve index file for primary.bam #130

Closed eboileau closed 2 years ago

eboileau commented 2 years ago

Description

I wish to use the latest NanoSim v3.0.1 (supports reading .gz sequence files, and bam files), but read_analysis.py does not complete. The primary.bam file is not indexed, an that might be an issue for pysam, but there seem to be more, this might be related to #129.

Error

2021-07-14 12:59:14: Processing alignment file: bam
[W::hts_idx_load3] The index file is older than the data file: analysis/Nanopore.bam.bai
021-07-14 12:59:16: Aligned reads analysis
[E::idx_find_and_load] Could not retrieve index file for 'analysis/nanosim_model/sim_primary.bam'

and further down

2021-07-14 12:59:17: match and error models
[E::idx_find_and_load] Could not retrieve index file for 'analysis/nanosim_model/sim_primary.bam'
Traceback (most recent call last):
  File "/beegfs/homes/eboileau/.miniconda3/envs/scNapBar-dev/bin/besthit_to_histogram.py", line 318, in hist
    cs_string = alnm.get_tag('cs')
  File "pysam/libcalignedsegment.pyx", line 2399, in pysam.libcalignedsegment.AlignedSegment.get_tag
  File "pysam/libcalignedsegment.pyx", line 2438, in pysam.libcalignedsegment.AlignedSegment.get_tag
KeyError: "tag 'cs' not present"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/eboileau/.miniconda3/envs/scNapBar-dev/bin/read_analysis.py", line 720, in <module>
    main()
  File "/home/eboileau/.miniconda3/envs/scNapBar-dev/bin/read_analysis.py", line 710, in main
    error_model.hist(prefix, alnm_ext)
  File "/beegfs/homes/eboileau/.miniconda3/envs/scNapBar-dev/bin/besthit_to_histogram.py", line 320, in hist
    cs_string = get_cs(alnm.original_sam_line.split()[5], alnm.get_tag('MD'))
AttributeError: 'pysam.libcalignedsegment.AlignedSegment' object has no attribute 'original_sam_line'

Expected behavior

read_analysis.py completes successfully, and generates the model to be used as input for simulator.py.

To reproduce

read_analysis.py genome -i Nanopore.fq.gz -ga Nanopore.bam -o nanosim_model/sim

and this also occurs with all uncompressed input files (I guess this is expected, since NanoSim now outputs compressed files anyway)

read_analysis.py genome -i Nanopore.fq -ga Nanopore.sam -o nanosim_model/sim

However, using NanoSim 2.5.0, the latest command is successful.

Environment

Python 3.7.6 conda 4.9.2 NanoSim 3.0.1 ( but Note that the version has not been updated in some scripts, e.g. read_analysis.py --version return NanoSim 3.0.0, although I am using 3.0.1 ) pysam 0.16.0.1 (samtools 1.10, using htslib 1.10.2)

cheny19 commented 2 years ago

Hi @eboileau , this is a different error which I cannot reproduce with your command. My gut feeling is this error happens with this line The index file is older than the data file: analysis/Nanopore.bam.bai. Could you try running read_analysis.py without your own alignment file? If you have to, make sure it contains either 1) CS tag or 2) both MD tag and cigar string.

kmnip commented 2 years ago

The primary.bam file is not indexed, an that might be an issue for pysam

Pysam can read BAM files line by line without an index file just fine. The index file is only required for random access on the BAM file. So, the message from Pysam, [E::idx_find_and_load] Could not retrieve index file, can be safely ignored. You would see this message in a successful run of NanoSim.

eboileau commented 2 years ago

Thanks for your prompt reply. I can reproduce this error with different input files. The error actually arises from the primary.bam file that is generated by NanoSim.

I can reproduce this error with v3.0.1. I also downloaded v3.0.0, as well as v3.0.0-beta, and I get the same error (but using uncompressed files as input). With v2.6.0 and v2.5.0, I have no error, and read_analysis.py finishes successfully, so this seems to be due to changes in v3.0.0-beta.

eboileau commented 2 years ago

@kmnip right, without an index, random access via fetch() and pileup() is disabled.

eboileau commented 2 years ago

I attach a small test set of about 200 Nanopore reads, mapped with

minimap2 -v1 -t 6 -ax splice --MD -ub /biodb/genomes/homo_sapiens/GRCh38_96/GRCh38_96.fa sample.fastq.gz | samtools sort - -@6 -T /scratch/global_tmp/ -o sample.bam

The NanoSim command to reproduce the error (v3.0.0-beta, v3.0.0, and v3.0.1) is

read_analysis.py genome -i sample.fastq.gz -ga sample.bam -o sim

test.tar.gz

eboileau commented 2 years ago

Actually, I was using NanoSim version release v3.0.0-beta, v3.0.0, and v3.0.1 (tags). But if I clone master, and run read_analysis.py, I have no error, only the warning [E::idx_find_and_load] Could not retrieve index file for 'sim_primary.bam. @cheny19 this might explain why you could not reproduce the error, if you were using the master branch.

It would be nice to be able to use the latest version release without error, and eventually be able to install it via conda. Thanks.

cheny19 commented 2 years ago

I see. It seems there's no much difference with the code in master branch and v3.0.1, so I guess the error has something to do with conda installation.

eboileau commented 2 years ago

v3.0.1 is not in Anaconda. I tested all releases (v3.0.0-beta, v3.0.0, v3.0.1, and v2.0.6) by just downloading the tag, and running the python script read_analysis.py. I did the same for master, I cloned, and ran the script. The rest of the environment remained the same for all tests. Only v2.0.5 was installed via conda, but v.2.05, v2.0.6, and master all were fine. I doubt this has to do with conda... And since dependencies remained the same, this must be something in NanoSim...

In any case, if master has no error, do you think it would be possible to create a new release soon? I don't know what are your development plans, but that would be great. Since I use NanoSim in my workflow, I would rather pin a specific release for reproducibility.

eboileau commented 2 years ago

By the way, were you able to reproduce and trace the error with the small dataset I uploaded, using releases v3.0.0-beta, v3.0.0, v3.0.1 ?

cheny19 commented 2 years ago

Oh, I finally figured out what went wrong! It's the pysam error when retrieving cs tag and MD tag. It was mentioned in another issue too, and fixed by the most recent commit, which is why the master branch works now. I can quickly make a release so you can tag the current master branch for your pipeline.

eboileau commented 2 years ago

Awesome, thanks! I'm waiting to see the next release.

eboileau commented 2 years ago

Thanks for updating bioconda with v3.0.2.

puwanat-s commented 7 months ago

Dear @cheny19 and @SaberHQ,

I am trying to use Nanosim to simulate ONT reads, but I encountered a similar problem described in this discussion. I ran read_analysis.py using Nanosim v3.1.0 and Nanosim v3.0.2. The input was a MinION result, my reference was ~16,000 human ORF sequence (example for heading: >176_C1_IOH63106_GPR30_2852_BC082766.1_AAH82766.1), and my alignment file is from minimap2 using the -ax sr mode.

The command I used for read_analysis.py was the following (for v3.1.0):

python ./NanoSim-3.1.0/src/read_analysis.py genome -i <my ONT reads .fastq> \ -r <my reference .fa> \ -t 48 \ -ga <my minimap2 alignment .sam> -o

I got the following errors for both versions of Nanosim:

2023-11-22 23:34:14: Read pre-process
2023-11-23 00:08:12: Processing alignment file: sam
2023-11-23 02:07:18: Aligned reads analysis
[E::idx_find_and_load] Could not retrieve index file for '/training_primary$bam'
2023-11-23 02:25:16: Computing KDEds processed >> 15085001
2023-11-23 02:25:31: Computing KDE for aligned region
2023-11-23 02:31:53: Computing KDE for aligned reads
2023-11-23 02:33:33: Computing KDE for unaligned length
2023-11-23 03:29:55: Computing KDE for ht ratio
2023-11-23 03:33:06: Unaligned reads analysis
2023-11-23 03:33:17: match and error models
[E::idx_find_and_load] Could not retrieve index file for '/training_primary.bam'
Traceback (most recent call last):
File "/NanoSim-3.1.0/src/besthit_to_histogram.py", line 322, in hist
cs_string = alnm.get_tag('cs')
File "pysam/libcalignedsegment.pyx", line 2399, in pysam.libcalignedsegment.AlignedSegment.get_tag
File "pysam/libcalignedsegment.pyx", line 2438, in pysam.libcalignedsegment.AlignedSegment.get_tag
KeyError: "tag 'cs' not present"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "./NanoSim-3.1.0/src/read_analysis.py", line 751, in
main()
File "./NanoSim-3.1.0/src/read_analysis.py", line 741, in main
error_model.hist(prefix, alnm_ext)
File "/NanoSim-3.1.0/src/besthit_to_histogram.py", line 324, in hist
cs_string = get_cs(alnm.cigarstring, alnm.get_tag('MD'))
File "pysam/libcalignedsegment.pyx", line 2399, in pysam.libcalignedsegment.AlignedSegment.get_tag
File "pysam/libcalignedsegment.pyx", line 2438, in pysam.libcalignedsegment.AlignedSegment.get_tag
KeyError: "tag 'MD' not present"

Dependencies: pysam 0.17.0 numpy 0.16.0.1 sklearn 0.22.1 six 0.16.0 scipy 1.4.1 HTSeq 0.11.2 joblib 0.14.1 samtools 1.12

Could you please help me identify what went wrong there ? Could it be the sam file or is it about the dependencies ? Thank you so much !

SaberHQ commented 7 months ago

Hi @puwanat-s Thanks for your interest in using NanoSim.

I am pretty sure the problem is not related to the dependencies. It is originating from the alignment file.

Would you please allow the NanoSim to run the reference alignment by minimap instead of providing the alignment file manually and see if it solves the issue. You may also run the minimap with the following arguments and then provide the alignment file (this is the same code which NanoSim runs internally if no alignment file is provided)

# Run minimap2 
minimap2 --cs -ax map-ont -t <num_threads> <ref_g> <input_reads_fasta> 
# And then create the bam format using samtools
samtools view -b > genome_alnm

Looks like the error is due to the missing --cs and --MD tags in the alignment file. The MD tag is a "String encoding mismatched and deleted reference bases, used in conjunction with the CIGAR and SEQ fields to reconstruct the bases of the reference sequence interval to which the alignment has been mapped."

Let us know if this solves your problem and keep us updated.

@kmnip Please kindly share your thoughts. Thanks.

Thanks, Saber.

puwanat-s commented 7 months ago

Thank you so much for your response @SaberHQ !

You are correct ! I checked the SAM file, and I could not see MD and cs tags. I re-ran minimap2 again with --cs and --MD in the command line, and the SAM got MD tag. I also re-ran nanosim, and it worked as the following:

2023-11-23 12:53:30: Read pre-process 2023-11-23 13:27:29: Processing alignment file: sam 2023-11-23 15:26:59: Aligned reads analysis [E::idx_find_and_load] Could not retrieve index file for '/training_primary.bam' 2023-11-23 15:45:03: Computing KDEds processed >> 15036001 2023-11-23 15:45:18: Computing KDE for aligned region 2023-11-23 15:52:38: Computing KDE for aligned reads 2023-11-23 15:54:17: Computing KDE for unaligned length 2023-11-23 16:56:13: Computing KDE for ht ratio 2023-11-23 16:59:26: Unaligned reads analysis 2023-11-23 16:59:37: match and error models [E::idx_find_and_load] Could not retrieve index file for '/training_primary.bam' 2023-11-23 17:45:58: Model fitting 2023-11-23 17:46:43: Mismatch fitting start Mismatch parameters: 0.17725125744851017 0.7598485311351446 0.8700526597240825 Residual 0.00011517506998737215
2023-11-23 17:46:56: Mismatch fitting done 2023-11-23 17:46:56: Insertion fitting start WARNING! Insertion parameters may not be optimal! 0.7479155658019974 0.8684862673872078 0.020162614741612066 0.9997655002543104 Residual 0.0003179841450340337 2023-11-23 17:47:48: Insertion fitting done 2023-11-23 17:47:48: Deletion fitting start WARNING! Deletion parameters may not be optimal! 0.8063015323928671 0.8572365930127233 0.14353489970389277 0.9886319864280954 Residual 0.00047630354180561163 2023-11-23 17:48:28: Deletion fitting done 2023-11-23 17:48:28: Finished!

I am excited with this result and will try the simulator now. Thanks so much for your help !

Best, Puwanat

kmnip commented 7 months ago

@puwanat-s : Thanks for reporting this. @SaberHQ : I have updated the README regarding the requirement of the cs tag in BAM files.

puwanat-s commented 7 months ago

Dear @SaberHQ and @kmnip,

By the way, I ran the simulator.py, and I noticed that the step "Start simulation of random reads" took very long time. At first I ran the following command:

./NanoSim-3.1.0/src/simulator.py genome -t 48 \ -rg <my fasta file with 16,000 ORF> \ -c <location of training*> \ -o \ -n 300000 \ -b guppy \ -dna_type linear \ --fastq

I got like ~100-200k reads simulated quickly, but then I got 10k random reads simulated and it kept running for very long time without reaching 300k reads. So, I quit the process and tried -n 20000 (default). I got the first 10k reads very quick, but then it started the simulation of random reads which took very long time and never finished.

My question is what exactly the step "start simulation of random reads" does and what could be an explanation why it takes very long time. Thank you so much again !

kmnip commented 7 months ago

Hi @puwanat-s ,

I am suspecting that this is due to your references sequences (ORFs) being very short. It is very likely that simulated reads (especially those with more/longer insertion errors than deletion errors) would be longer than the longest ORFs. So, it is stuck in an infinite loop at this part of the code: https://github.com/bcgsc/NanoSim/blob/adfd7c6be40ea735c8fd208260fc7a2f97ccdac0/src/simulator.py#L1324-L1325

Can you please try the following branch and let us know whether that fixes the issue on your end? https://github.com/bcgsc/NanoSim/tree/fix_inf_loop

puwanat-s commented 7 months ago

Hi @kmnip,

Thank you so much for your suggestion ! Indeed, the ORF lengths are varied from ~60-7000nt. I will definitely try the simulator.py that fixes the infinite loop and let you know how it goes.

In the meantime, I would like to elaborate that I try to use Nanosim to benchmark several short and long aligners (e.g. Bowtie2, BWA-MEM, Minimap2, Winnowmap2, NGMLR) to retrieve DNA barcoded ORF pairs. I think the goal here is to use real ONT reads to simulate the DNA barcoded ORF pairs with errors close to real ONT sequencing and establish ground truth to compare the read aligners (i.e. known DNA barcoded ORF pairs from the simulator compared to those retrieved from read aligners). So, I think I will try insert the ORFs from the fasta file into the plasmid backbone sequence with unique DNA barcodes and then run the simulator. Is this strategy better than using the ORF sequences alone, which would force the simulated sequences to have random nucleotides flanking the ORF sequences ?

UPDATE:

The edited simulator.py worked ! The run finished and I got 20k simulated reads.

Thank you so much again for your quick response !

kmnip commented 7 months ago

Hi @puwanat-s , Thanks for reporting your results. We are very glad to hear that the quick-fix had solved the issue on your end!