38 / d4-format

The D4 Quantitative Data Format
MIT License
150 stars 20 forks source link

Part size limit causing odd results every 10M bases? #67

Open onordesjo opened 1 year ago

onordesjo commented 1 year ago

Hi, thanks for a great tool!

I'm seeing some issues in a d4 file (created with d4tools create -A basecalls.bam basecalls.d4).

There appear to be large flat regions in the d4 file every ~10M bases (apparently coinciding with the part size limit?)

It can be seen when running the following for example:

d4tools show basecalls.d4 chr1_MATERNAL | csvtk add-header -n  "ref,st,en,depth" | csvtk mutate2 -n diff -e '$en-$st' | csvtk filter2 -f '$diff > 100000'

ref     st      en      depth   diff
chr1_MATERNAL   0       560877  0       560877.00
chr1_MATERNAL   10000000        10657852        0       657852.00
chr1_MATERNAL   19999212        20671130        0       671918.00
chr1_MATERNAL   40000000        40199077        0       199077.00
chr1_MATERNAL   50000000        50549749        0       549749.00
chr1_MATERNAL   60000000        60798861        0       798861.00
chr1_MATERNAL   110000000       110292516       0       292516.00
chr1_MATERNAL   120000000       120302258       0       302258.00
chr1_MATERNAL   140000000       140279366       0       279366.00
chr1_MATERNAL   150000000       151804768       0       1804768.00
chr1_MATERNAL   170000000       170202385       0       202385.00
chr1_MATERNAL   179994048       180318995       0       324947.00
chr1_MATERNAL   190000000       190450433       0       450433.00
chr1_MATERNAL   220000000       220166024       0       166024.00

I have checked with samtools depth that I don't get unexpected outputs in these regions, so the bam appears to be well-formed.

Is there anything that can be done to achieve correct depth in these regions, by either increasing the part size, or by otherwise stitching things together?