Downstream tools for DGEA are not taking advantage of index files due to wrong naming convention

biowdl / RNA-seq

A BioWDL pipeline for processing RNA-seq data, starting with FASTQ files to produce expression measures and VCFs. Category:Multi-Sample

https://biowdl.github.io/RNA-seq/

MIT License

31 stars 17 forks source link

Downstream tools for DGEA are not taking advantage of index files due to wrong naming convention #75

Closed EngineerReversed closed 4 years ago

EngineerReversed commented 4 years ago

I have been running the pipeline with the default code and test data as provided in the repository. However, I found that the bam index file produced has a naming convention different from what the tools expect(Tools don't explicitly require it but they expect the bam index file in the same folder as of the bam file.

Biowdl naming convention

bam file name: rna3-paired-end.markdup.bam bam index file name : rna3-paired-end.markdup.bai

Tools naming convention(expected)

bam file name: rna3-paired-end.markdup.bam bam index file name : rna3-paired-end.markdup.bam.bai

The second naming convention allows tools to use index files. This makes tools directly jump to required read or chromosome rather than processing file sequentially thereby increasing speed and reducing computation time.

PS: I noticed that there is no documentation that mentions the naming convention but it is just a community agreed convention.

DavyCats commented 4 years ago

Both .bai and .bam.bai are frequently used within the community. Most (if not all) tools I have encountered support both of these extensions and some of the most commonly used toolkits produce the .bai (as opposed to .bam.bai) extension, eg. GATK and Picard. This is actually the reason why we use .bai: the BAM file given back to the user is produced by GATK or Picard, at least originally that was the case.

htslib, the most commonly used library for SAM/BAM parsing (most other parsers are mere bindings to htslib) supports both. htsjdk, the java equivalent, supports both as well.

Are there any specific tools you are having issues with?

EngineerReversed commented 4 years ago

I found the htseq-count to be slower than expected so I changed extension from .bai to .bam.bai.

PS: I also added few more functionalities like qualimap, regtools(for bamtointron) etc. in the pipeline. I found regtools specifically demanded .bam.bai extension.

DavyCats commented 4 years ago

htseq-count uses pysam, which supports both extensions. HTSeq actually prefers name-sorted BAM files, for which you can't have an index, in order to deal with multi mapping more easily.

I'm not familiar with qualimap or regtools, but had a quick look. Qualimap (v.2.2.2-dev, installed with conda, qualimap bamqc --bam rna3.bam) showed me this:

WARNING: BAM index file /home/dcats/git/RNA-seq/expression-quantification/tests/data/rna3.bai is older than BAM /home/dcats/git/RNA-seq/expression-quantification/tests/data/rna3.bam

So it seems to have found the index.

Regtools (version 0.5.2, installed with conda, regtools junctions extract rna3.bam -s 0) also ran without issue. I couldn't find a bamtointron command, so I tested using junctions extract instead.

EngineerReversed commented 4 years ago

Thanks for trying out with Qualimap and regtools. Your results show that you were able to get index file detected by above tools. May be there is some other problem I am facing. I will try out with public data from NCBI to confirm my findings.

As a good practice, which naming convention should I follow or which one is going to be more prevalent and supported by community in future?

DavyCats commented 4 years ago

Either should be fine really. I don't believe there is really a trend specifically towards one or the other.

As for the issue you are facing. Did you add these steps to the WDL and, if so, did you also add the index as an input of type File to the tasks? Otherwise, cromwell won't localize them and the tools won't be able to find them.

EngineerReversed commented 4 years ago

As for the issue you are facing. Did you add these steps to the WDL and, if so, did you also add the index as an input of type >File to the tasks? Otherwise, Cromwell won't localize them and the tools won't be able to find them.

Yeah, I did that.

I tested the pipeline with NCBI data and it seems to detect the index files. I am not able to reproduce the error for now. Thanks for the discussion.