Closed EngineerReversed closed 4 years ago
Both .bai
and .bam.bai
are frequently used within the community. Most (if not all) tools I have encountered support both of these extensions and some of the most commonly used toolkits produce the .bai
(as opposed to .bam.bai
) extension, eg. GATK and Picard.
This is actually the reason why we use .bai
: the BAM file given back to the user is produced by GATK or Picard, at least originally that was the case.
htslib, the most commonly used library for SAM/BAM parsing (most other parsers are mere bindings to htslib) supports both. htsjdk, the java equivalent, supports both as well.
Are there any specific tools you are having issues with?
I found the htseq-count
to be slower than expected so I changed extension from .bai to .bam.bai.
PS: I also added few more functionalities like qualimap
, regtools
(for bamtointron) etc. in the pipeline. I found regtools
specifically demanded .bam.bai extension.
htseq-count uses pysam, which supports both extensions. HTSeq actually prefers name-sorted BAM files, for which you can't have an index, in order to deal with multi mapping more easily.
I'm not familiar with qualimap or regtools, but had a quick look.
Qualimap (v.2.2.2-dev, installed with conda, qualimap bamqc --bam rna3.bam
) showed me this:
WARNING: BAM index file /home/dcats/git/RNA-seq/expression-quantification/tests/data/rna3.bai is older than BAM /home/dcats/git/RNA-seq/expression-quantification/tests/data/rna3.bam
So it seems to have found the index.
Regtools (version 0.5.2, installed with conda, regtools junctions extract rna3.bam -s 0
) also ran without issue. I couldn't find a bamtointron
command, so I tested using junctions extract
instead.
Thanks for trying out with Qualimap
and regtools
. Your results show that you were able to get index file detected by above tools. May be there is some other problem I am facing. I will try out with public data from NCBI to confirm my findings.
As a good practice, which naming convention should I follow or which one is going to be more prevalent and supported by community in future?
Either should be fine really. I don't believe there is really a trend specifically towards one or the other.
As for the issue you are facing. Did you add these steps to the WDL and, if so, did you also add the index as an input of type File
to the tasks? Otherwise, cromwell won't localize them and the tools won't be able to find them.
As for the issue you are facing. Did you add these steps to the WDL and, if so, did you also add the index as an input of type >File to the tasks? Otherwise, Cromwell won't localize them and the tools won't be able to find them.
Yeah, I did that.
I tested the pipeline with NCBI data and it seems to detect the index files. I am not able to reproduce the error for now. Thanks for the discussion.
I have been running the pipeline with the default code and test data as provided in the repository. However, I found that the bam index file produced has a naming convention different from what the tools expect(Tools don't explicitly require it but they expect the bam index file in the same folder as of the bam file.
Biowdl naming convention
bam file name:
rna3-paired-end.markdup.bam
bam index file name :rna3-paired-end.markdup.bai
Tools naming convention(expected)
bam file name:
rna3-paired-end.markdup.bam
bam index file name :rna3-paired-end.markdup.bam.bai
The second naming convention allows tools to use index files. This makes tools directly jump to required read or chromosome rather than processing file sequentially thereby increasing speed and reducing computation time.
PS: I noticed that there is no documentation that mentions the naming convention but it is just a community agreed convention.