Indexing Bam files for large genome mapping

sebel76 commented 1 year ago

Hi Micheal (again x2),

I have one observation to report: Indexing Bam files don't work on large genome as Samtools need the '-c' option to index .csi files. I turn around editing the following lines:

Original code: 750 idx_args = (['samtools', 'index', '-@', str(args.threads), 773 idx_args = ['samtools', 'index', '-@', str(args.threads), mb] 1076 subprocess.run(f'samtools index {bam}', shell=True)

Edited code: 750 idx_args = (['samtools', 'index', '-c', '-@', str(args.threads), 773 idx_args = ['samtools', 'index', '-c', '-@', str(args.threads), mb] 1076 subprocess.run(f'samtools index -c {bam}', shell=True)

After these changes, it worked well for all datasets mapping to large genomes. Except for the bug reported in a previous issue about StructVis empty result files.

Best, Sébastien

MikeAxtell commented 1 year ago

Hi Sébastien,

Thanks for pointing this out. I confess I have never worked with cases where a single chromosome was bigger than 2^29 (~512Mb) in length, so I wasn't even aware of the alternative .csi format for .bam indices. I will fix this in the next release. I think it may be more lines that the one you point out (for various reasons).

Can you suggest a good, publicly available, genome assembly that has chromosome that large (over 2^29) so I can test?

sebel76 commented 1 year ago

Hi Micheal,

When I use a new tool, I go to the code and modify it with following:

samtools index -c
bowtie --large-index

Then, I can use it without worry about the genome size; it processes small and large genomes. As an example, I can process Brachypodium beside bread wheat. I guess that there is some computational cost to analyse sRNA to the small genomes, but it simplify my code and task management.

If you want to make your test with a large genome, I would suggest the rye (Secale cereale). This is an huge diploid genome having seven chromosome for a total of ~8Gb. I think there are 3-5 chromosomes with more than 1Gb.

Recently, a good quality genome assembly and annotation had been release and published. You can find the information and download there:

Best, Sébastien

MikeAxtell commented 1 year ago

Great, thanks. I'll use this as a test-case to implement support for mega-large genomes / chromosomes.

crisalves80 commented 1 year ago

Hi Mike, I'm having a similar issue with large genomes:

Aligning sRNA.fastq First pass alignment with bowtie using 19 threads Second pass - placing multimappers using 19 threads to process 136 chunks [bam_sort_core] merging from 19 files and 19 in-memory blocks... [E::hts_idx_check_range] Region 536895685..536895703 cannot be stored in a bai index. Try using a csi index [E::sam_index] Read 'SRR13424690.10524618' with ref_name='Chr01', ref_length=551969684, flags=16, pos=536895686 cannot be indexed samtools index: failed to create index for "Wmir.sRNA/sRNA.bam": Numerical result out of range

Converting to sorted bam format FATAL: samtools index encountered an error!

MikeAxtell commented 1 year ago

Thanks @crisalves80 , yes I hope to make time to work on the 'big genome' issue soon. @sebel76 .. for Secale cereale I cannot find any public sRNA-seq datasets. Maybe I am just missing them. Can you specify some SRA accession numbers for sRNA-seq data from S. cereale? Or @crisalves80 can you suggest an alternative large genome, publicly available, and some corresponding sRNA-seq datasets (SRA accession numbers)?

crisalves80 commented 1 year ago

Hi @MikeAxtell , I changed the code according to sebel76, adding the "-c" for the samtools indexing and it worked. The strucVis folder is empty though. I'm working with some public datasets for this genome of ~7Gb. Those are the sRNA SRA: SRR13424687 SRR13424688 SRR13424689 SRR13424690 SRR13424691 SRR13424692 and the genome for W.mirabilis is at https://datadryad.org/stash/dataset/doi:10.5061/dryad.ht76hdrdr

MikeAxtell commented 1 year ago

Perfect, thanks for the data pointer.

Yes, just changing that one line is not enough for all of the other parts of the script that would need to detect and use large bam indices. Will fix in next release.

sebel76 commented 1 year ago

Hi Micheal,

Sorry for my slow reply; I could no reply earlier.

There is bread wheat with ~15.5 Gb genome and plenty sRNA-sew data available. You can used the data that I added to NCBI for my work in 2020. You can find the SRA accession number on https://doi.org/10.1104/pp.20.00816

I would recommend to work with rye because there is ultra long chromosomes over 1 Gb while not in wheat.

I have sRNA data in rye but not publish yet, thus not public.

I will be happy to share it with you in private.

Tell me if you want it and I will sent you an email to get the details where you want me to upload it to a server.

Best, Sébastien

MikeAxtell commented 1 year ago

@sebel76 thanks! I think I will be able to use the Welwitschia mirabilis genome and data that @crisalves80 pointed me to.

Hopefully I will be able to find some time to get to this soon!

MikeAxtell commented 1 year ago

ShortStack's csi index issues have been resolved as of 8adb7b4. This will be included in the next release (which should be soon). I will leave this issue open until the release is done and running on bioconda.

There were related issues in the helper scripts strucVis and ShortTracks which have been fixed by new releases of both of those scripts. Those updates should also hit bioconda soon.

MikeAxtell commented 1 year ago

Should note that now ShortStack when creating alignments will always use the .csi format of BAM indices; it will no longer in case create a .bai file when aligning. It will however accept user-provided bamfiles that are .bai indexed.

MikeAxtell commented 1 year ago

Fixed in release 4.0.1

MikeAxtell / ShortStack

Indexing Bam files for large genome mapping #118