38 / d4-format

The D4 Quantitative Data Format
MIT License
156 stars 20 forks source link

create fails on some cram/bams #56

Closed petersudmant closed 2 years ago

petersudmant commented 2 years ago

I've noticed that some crams/bams seem to result in empty d4 files. I'm not quite sure why. An example are the HGDP cram files (e.g. ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGDP/data/Yi/HGDP01180/alignment/HGDP01180.alt_bwamem_GRCh38DH.20181023.Yi.cram).

I can even subset these to smaller bams (say including just a small portion of chr1) and creation still fails. Let me know if you'd like me to post one of these small bams.

38 commented 2 years ago

Thanks for reporting the issue. If you can provide a smaller bam that would be great! Thanks!

petersudmant commented 2 years ago

Here's an example bam: https://drive.google.com/drive/folders/1qKVTi4xnFrB9a3GXqGuWeXKaficFGMSR?usp=sharing

This was created by taking the aforementioned cram (see previous post) and running

samtools view -b -T GRCh38_full_analysis_set_plus_decoy_hla.fa HGDP01180.altbwamem_GRCh38DH.20181023.Yi.cram chr1:1-5000000 >example.bam
samtools index example.bam
d4tools create example.bam example.d4

the latest release d4-format-d4tools-v0.3.7 was used

38 commented 2 years ago

Hi, just checked the BAM file, it seems the alignments mapping qualities are blow the default minimal mapping quality threshold.

You can add -q 0 to the command arguments you run d4.

$ d4tools stat example.d4 | grep -v '0$'
$ d4tools create example.bam -q0
$ d4tools stat example.d4 | grep -v '0$'
chr1    0       248956422       0.8444072031208739

Note if not specified the default min mapping quality is set to 60.

petersudmant commented 2 years ago

Ahhhhh -> the max quality in all these crams is 20 because it was capped. Thanks much. That was driving me nuts.

petersudmant commented 2 years ago

Would be helpful if the --help noted that default was set

snystrom commented 2 years ago

This same thing has bit me a couple times. My default assumption about how a tool like this should behave is that it would do no filtering unless asked. q60 is a lot more stringent than I would expect for a default besides. Compare to other tools like Deeptools bamCoverage which do as few manipulations to the input data by default, allowing the user to specify additional stringency.

If the default isn't adjusted (which might be too breaking at this point, I'm not sure), please do make this very clear in the help docs since it is not obvious in the current state:

-q, --mapping-qual <mapping-qual>
            The minimal mapping quality (Only valid with CRAM/BAM inputs)
petersudmant commented 2 years ago

+1 @snystrom

arq5x commented 2 years ago

I agree. I know it would be breaking, but I think it makes more sense to default to zero (filter nothing) and make it clear in the release. @38 - what do you think?

petersudmant commented 2 years ago

Still seems to be a bug here, and something of a tricky one. Here's a tiny bam example:

https://drive.google.com/drive/folders/1qKVTi4xnFrB9a3GXqGuWeXKaficFGMSR?usp=sharing

This bam includes reads of various mapqs. Now, make a d4 file from this bam using -q0 or -q60. The results show that only the reads prior to the first q60 read are filtered out when using -q60 (I think). The d4 files are essentially identical. Now instead filter the bam w/ samtools for only -q60 reads and make a d4 file. The results are wildly different. See below. The bug is tricky to figure out in the code...

#start w/ test.0.bam test.0.bai

d4tools={path_to_d4_tools}
samtools view -b -q60 test.all.bam > test.q60.bam
samtools index test.q60.bam

$d4tools create -z -q 60 test.all.bam test.all.q60.d4
$d4tools create -z -q 0 test.all.bam test.all.q0.d4
$d4tools create -z -q 0 test.q60.bam test.q60.q0.d4

$d4tools  view test.all.q60.d4 chr1 >view.all.q60
$d4tools  view test.all.q0.d4 chr1 >view.all.q0
$d4tools  view test.q60.q0.d4 chr1 >view.60.q0

diff view.all.q60 view.all.q0
diff view.all.q0 view.60.q0
38 commented 2 years ago

Hi @petersudmant, thanks for reporting this bug. Just pushed a fix to the master branch, it would be really helpful if you can confirm the fix actually works.

petersudmant commented 2 years ago

will do a bit later today! I noticed that && =). But, it actually seems to work fine w/ & interestingly. I'll let u know!

arq5x commented 2 years ago

Given the critical nature of this fix, I think we need a new 0.4.0 release for the CLI and pyd4.

petersudmant commented 2 years ago

My tests suggest this fix worked (though, I have only run on a couple of examples). Thanks so much! I'll run some more testing over the next 36 hours. To be sure, I concur w/ @arq5x - this is a massive change in the outputs of the d4 files. Thanks so much for being so responsive w/ these fixes!