cancerit / PCAP-core

NGS reference implementations and helper code for mapping (originally part of ICGC-TCGA-PanCancer)
GNU General Public License v2.0
9 stars 10 forks source link

does the output depend on the input file format? #52

Closed mgcam closed 4 years ago

mgcam commented 4 years ago

bam_stats v 4.4.1 was tested it on a cram and a bam file for the same data and found tiny differences in #_divergent_bases values (example 264907761 (bam) vs 264911747 (cram)). Is this a known feature? The tool was compiled against libhts v. 1.10.2 (on behalf of NPG)

keiranmraine commented 4 years ago

Had the BAM file been processed to correct any ambiguity bases? If not there are differences when CRAM regenerates the MD tags as it will use the real reference base rather than the "random" [ACGT] that BWA fills N and ambiguity codes with.

I'm assuming that was the only affected field.

We have not updated this to support any relevant API changes in the C layer so there may be some impact there.

mgcam commented 4 years ago

What input do you use in your pipeline? We'd like to have the same results as you have.

keiranmraine commented 4 years ago

You haven't answered the questions I asked to try an identify the cause

Had the BAM file been processed to correct any ambiguity bases? If not there are differences when CRAM regenerates the MD tags as it will use the real reference base rather than the "random" [ACGT] that BWA fills N and ambiguity codes with.

I'm assuming that was the only affected field.

We have not updated this to support any relevant API changes in the C layer so there may be some impact there.

We are using htslib 1.9 as indicated in the Dockerfile.

What input do you use in your pipeline? We'd like to have the same results as you have.

I'm not sure how the input for our pipeline is relevant to you not getting the same result from bam_stats for BAM/CRAM of the same data