38 / d4-format

The D4 Quantitative Data Format
MIT License
150 stars 20 forks source link

New stat function: % of positions in intervals with coverage >X #81

Open Jakob37 opened 1 month ago

Jakob37 commented 1 month ago

This is something we are looking at over at the https://github.com/Clinical-Genomics/chanjo2 repo.

The current approach is to read the output from the d4 file using d4tools show. It would be much quicker to do this directly in d4tools.

The execution could be something like:

$ d4tools stat --stat perc_cov "10,20,30" --region "intervals.bed"

The desired output would be something like:

chr     start           end             10x  20x   30x
19      31030023        32034023        98.3 96.3  94.1
19      49624990        49625000        94.2 90.3  85.3
19      49625010        49625020        88.5 80.3  75.3

Feedbacks on this are welcome. If I have the time, and no one else jumps on it, I'll give it a go and see if I can implement it.

cademirch commented 1 month ago

Funnily enough I've been working on something related to this using the d4 crate, though I hadn't considered implementing it within d4tools show. Unfortunately I haven't had time lately to finish it up.

Your idea should be relatively straightforward to implement using d4's Tasks. I'd be happy to take a stab when I have some more free time in the coming weeks. Also can chip in if you've already started!

Jakob37 commented 1 month ago

Funnily enough I've been working on something related to this using the d4 crate, though I hadn't considered implementing it within d4tools show. Unfortunately I haven't had time lately to finish it up.

Your idea should be relatively straightforward to implement using d4's Tasks. I'd be happy to take a stab when I have some more free time in the coming weeks. Also can chip in if you've already started!

This is great to hear @cademirch I haven't gotten started with this yet. This would be extremely useful for us, as it would resolve our (hopefully) final performance bottleneck before we could use this in production.

I would be happy to help out with testing or in any other ways that might be helpful, just let me know!

cademirch commented 1 month ago

@Jakob37 Ended up having some free time today and took a crack at this here: https://github.com/cademirch/d4-format/tree/perc_cov

I think it works as expected... I tested a little on some random d4 files:

d4tools show -R regions.bed input.d4:

JANIGQ010000002.1       0       5       0
JANIGQ010000002.1       5       29      1
JANIGQ010000002.1       29      80      2
JANIGQ010000002.1       80      100     3

Then, d4tools stat -H -s perc_cov=1,2,5 -r regions.bed input.d4:

#Chr    Start   End     1x      2x      5x
JANIGQ010000002.1       0       100     0.940   0.700   0.000

Currently this wont work with multi-track d4 files, so just fyi. Also, I'm super new to Rust so this is probably not merge worthy. Lmk what you think!

Jakob37 commented 1 month ago

Nice @cademirch , I'll take a look during the day!

Jakob37 commented 1 month ago

I am trying this out. It looks great!! Our previous way of processing 74 intervals with three thresholds took ~15 seconds. With your function it takes only 0.5s.

The results looks identical to the ones we produces as well. So it looks like it is both quick and doing what it should. Nice job 💪

Jakob37 commented 1 month ago

It would be great to get this one into d4tools in some way actually. I looked at the diff, and the code looks well organized, but I am unfortunately not experienced in Rust either.

Still, could you put this up as a PR? And then we could see if anyone else could take a look at it.