BradnerLab / pipeline

bradner lab computation pipeline scripts
Other
52 stars 47 forks source link

Correct binsize calculation #70

Open Drmirdeep opened 4 years ago

Drmirdeep commented 4 years ago

Binsize calculation is based on the gff input file and thus needs to add 1 for the region size. Since this was missing an offset error accumulated with the number of bins used and the end of a region was never been used for calculation. E.g. A region with 60 bins would lack the last 60bp of a region due to the offset error accumulation.

jdimatteo commented 4 years ago

Sorry, this looks like a serious error.

@Drmirdeep Do you have time to add a unit test for this change ? Unit tests are here: https://github.com/BradnerLab/pipeline/blob/master/bamliquidator_internal/bamliquidatorbatch/test.py

@charlesylin does someone from your team have time to review these changes?

jdimatteo commented 4 years ago

I'm sorry for my long delay and that I can't devote more time to this right now. I'll follow up with tests and rigorous review in a week or two if nobody else has time.

jdimatteo commented 4 years ago

@Drmirdeep I discussed briefly with @charlesylin . I think you might be using a different coordinate system than the one assumed by bamliquidator. Can you please review the below article and advise me? As I understand it, most bamliquidator users expect the current behavior which uses the UCSC coordinate system, so maybe it would make sense to add an option to override the default coordinate system. I'd appreciate your recommendation here.

http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems/

Drmirdeep commented 4 years ago

No worries, everybody is quite busy. I may have some time over chrismas to make a unit test for it. Anyhow the tool demands a gff as input (1-based and fully closed) but then uses bed file logic internally (0-based and half open). Nonetheless, the interval sizes are somehow wrongly calculated in any case. With my little corrections it worked as expected for this part of the code. This I checked quite extensively already by running the internal bins which report back the coverage and the offset there is pretty obvious. I didn't take a closer look at the rest of the source which would need to much time now.

jdimatteo commented 4 years ago

Can someone please assign themselves to review this? I'm sorry I can't for the foreseeable future. I can't just merge this because there are concerns that changing the bin size will break other people's workflow suddenly resulting in an unexpected change in behavior.

@Drmirdeep if you'd like to add an argument so that by default the bin size behavior remains unchanged, I'd feel more comfortable approving and merging this in.

vsoch commented 4 years ago

@Drmirdeep you can also rebase with master to get fixes for testing, along with your change.