bartongroup / yanocomp

Yet another nanopore modification comparison tool
MIT License
11 stars 1 forks source link

Empty outputs with lower coverage #15

Open bhargava-morampalli opened 6 months ago

bhargava-morampalli commented 6 months ago

I have been testing yanocomp on multiple coverage levels for the same data and for lower coverages (less than 70x), the output is empty. Is there anything with how the tool works that causes this? In general, as the coverage is going down (from 1000x), the output keeps getting truncated in some positions (some positions does have low coverage compared to overall coverage due to the way filtering was done) and completely empty at less than 70x coverage.

It would be helpful if anyone can explain why this is happening. Thank you.

mparker2 commented 6 months ago

Hi @bhargava-morampalli,

Yanocomp is not really being actively developed any more, but I looked at the code to remind myself how it is working...

The minimum coverage is set dynamically depending on the window size used for modelling:

https://github.com/bartongroup/yanocomp/blob/afda4b50f53de3d039253c6cba54b19a66b2f436/yanocomp/gmmtest.py#L245-L251

So, when the default window of 3 adjacent kmers is used for modelling, the min depth (per replicate) is set to 6. This is to ensure that the number of samples used to fit the model is always greater than the number of features.

Coverage in an RNA sequencing dataset can vary across several orders of magnitude depending on the gene, so it is unclear to me how you can be sure that all positions have approximately 70x, unless you are just testing one gene?

bhargava-morampalli commented 6 months ago

That's exactly right, I am testing it on one gene and the coverage is close to 70x after filtering it based on total bases but there are definitely areas where the coverage dips a lot (may be due to which reads were included in the filtered dataset) Weirdly, there is not even an output below 70x - not sure why thats happening.

mparker2 commented 6 months ago

That is weird, I'm not sure what is occurring in that case. Does it consistently happen with many different subsamples?