Occasionally large differences in coverage for same gene region in same sample

Jakob37 commented 2 months ago

Hello dear d4tools devs. Thank you for a very useful tool!

We are looking into using the d4 format as a basis for coverage calculations for patient samples, through the tool Chanjo2 (https://github.com/Clinical-Genomics/chanjo2) currently maintained by @northwestwitch.

Testing this out, I stumbled upon some strange numbers when investigating real samples. Some seemed to have very high overall coverage numbers, ~1000x, where 30-50x is expected.

I investigated this for one sample where I received 800x coverage. Running the genes directly through d4tools revealed some genes with huge coverage

$ d4tools stat --region intervals.bed <sample>_coverage.d4 | sort -k4,4 -n -r
19      49635292        49640143        30268.71325499897
10      119651380       119677819       12509.91527667461
11      19182030        19210571        6439.562664237413
3       14125015        14143680        3130.0193410125903
1       77888513        77943895        2294.566086454083
15      38252836        38357249        1283.3227088580925
15      66386837        66491656        1249.3048111506503
12      112418351       112509918       665.9010123734533
18      36297713        36780220        405.1971349638451
...

Looking manually in IGV it looked more like 30-40x coverage.

Running only that specific region, I get more reasonable coverage. But this is really strange isn't it? That we get hugely different coverage for the same region dependent on when running one or several regions.

$ d4tools stat --region gene_19.bed <sample>_coverage.d4
19      49635292        49640143        39.8051948051948

I looked closer at this, and found that when running a bedfile with this region and others, it tended to get 30000x coverage, but when only running it by itself, it got 40.

The d4 file was generated using the very default command:

d4tools create --threads 10 <bam> <d4>

Is this a bug or am I missing something here?

Thankful for any input. Let me know if I can help in any way by providing more information. This is the final blocking question mark before starting using d4tools in practice, so it is quite urgent for us.

I'll continue investigating this. Unfortunately I cannot share the datasets I have used so far, but will investigate if the issue is still there if running a GIAB sample.

Jakob37 commented 2 months ago

More context. I am running d4tools version 0.3.9 through a singularity container, but got identical results when running it through Chanjo2.

Generating the d4 file in sparse mode using the same BAM file I get a similar pattern.

$ d4tools create --sparse --threads 5 <bam> d4/sparse.d4

$ d4tools stat --region bed/intervals.bed d4/sparse.d4 > out.tsv
$ cat out.tsv | sort -k4,4 -n -r | head
19      49635292        49640143        39125.30612244898
10      119651380       119677819       16313.91516320587
11      19182030        19210571        8398.013699590063
3       14125015        14143680        4069.0358424859364
1       77888513        77943895        3010.7529522227437
15      38252836        38357249        1686.8336701368603
15      66386837        66491656        1627.4914948625726
12      112418351       112509918       868.0358207651228
18      36297713        36780220        532.3012909657269
...

Single gene gives lower number, also differing from the non-sparse run:

$ d4tools stat --region bed/gene_19.bed d4/sparse.d4 | head
19      49635292        49640143        48.781488352916924

Jakob37 commented 2 months ago

OK I think I understood some more now, after poking around in the code. It looks like it occasionally is accumulating all reads from an upstream region down to the end of the target region.

I.e. I am working with these regions:

$ cat data/intervals_19_only.bed 
19      45769709        45782552
19      4090321 4124122
19      49635292        49640143
19      55151767        55157773

The given value for the 49635292 - 49640143 range is actually nts collected 45769709 - 49640143 divided by the number of nucleotides between 49635292 - 49640143.

Seems like the self.sum value in the impl TaskPartition isn't cleared between the runs (in means.rs), and also somehow accumulating over the region between the regions.

Jakob37 commented 2 months ago

Looks like it might have to do with how ranges are assigned inside tracks.rs, here:

https://github.com/38/d4-format/blob/945b36b46e0d22bf5e1d1ebbf7f7a59b9470fdd4/d4/src/d4file/track.rs#L72-L75

Seems the this_end variable can be assigned the end of the last previous interval, leading to accumulative counting starting from there rather than the beginning of the most recent interval.

After updating it as such, starting the accumulation from the beginning of the range instead, I get sane values.

if this_end != last_end {
  let this_begin = top.get_range().0;
  func(this_begin, this_end, &mut active_heap);
  last_end = this_end;
}

Comparing runs with and without this change I get same coverage for some intervals, while differing for others. I don't understand yet what is going on there 🤔 Guessing this might be like this for a reason though. Would like some input on this from the devs if possible 🙏 @38 @arq5x

Jakob37 commented 2 months ago

A final note for this week. I could reproduce the pattern using a GIAB sample, making me think it has more to do with the pattern in the region file than with the content of the d4 file.

My bed-file:

19      45769709        45782552
19      4090321 4124122
19      49635292        49640143
19      55151767        55157773

Running all together. Note the third coverage value of 30000+.

$ d4 stat --region bed/intervals_only_19.bed hg002_coverage.d4
19      4090321 4124122 47.764267329368955
19      45769709        45782552        45.254146227516934
19      49635292        49640143        34655.42465471037
19      55151767        55157773        37.25041625041625

Running one and one. Note that three are identical to the previous run.

$ time for interval in bed/intervals_only_19/*.bed; do d4 stat --region ${interval} hg002_coverage.d4; done
19      4090321 4124122 47.764267329368955
19      45769709        45782552        45.254146227516934
19      49635292        49640143        46.97567511853226
19      55151767        55157773        37.25041625041625

brentp commented 2 months ago

@Jakob37 , thanks for the clear report. I'll have a look at this next week, but based on what you've found, could you do the same change after the while loop on lines 84-87 (now 85-88 in your code)?

Jakob37 commented 2 months ago

@Jakob37 , thanks for the clear report. I'll have a look at this next week, but based on what you've found, could you do the same change after the while loop on lines 84-87 (now 85-88 in your code)

Sounds great, thanks @brentp ! I am away from computer now, will continue to look at this and your suggestion when back on Tuesday

Jakob37 commented 2 months ago

@Jakob37 , thanks for the clear report. I'll have a look at this next week, but based on what you've found, could you do the same change after the while loop on lines 84-87 (now 85-88 in your code)?

I tried adding the change also to the second func call after the while loop. But when running my test examples that part of the code is never triggered, i.e. active_heap is always empty when reaching that part. So it does not make any difference in my case.

Let me know if I can provide some other info to help you out here.

Jakob37 commented 2 months ago

Looking into why this issue only happens sometimes.

Looks like the data is partitioned and calculations performed on each partition separately (maybe used for threading).

When ranges are processed in different partitions, there is no issue. But when multiple are processed in the same partition, the issue mentioned above is triggered.

Here, the first range ends just below 40,000,000. The coverage result of the second range looks sane.

19      30000000        39999999        49.76921967692197
19      49624990        49625000        39.2
19      49625010        49625020        74.5

Shifting it just one to an even 40,000,000, it looks insane:

19      30000000        40000000        49.7692204
19      49624990        49625000        47230219.2
19      49625010        49625020        74.5

Jakob37 commented 2 months ago

@brentp have you been able to take a look at this?

I am considering putting together a PR with the minimal changes I made, but would be great to have someone with more contextual understanding of this code base look at this.

Jakob37 commented 1 month ago

This should be fixed by #79 and included in version 3.10

38 / d4-format

Occasionally large differences in coverage for same gene region in same sample #78