Open Drmirdeep opened 5 years ago
Sorry, this looks like a serious error.
@Drmirdeep Do you have time to add a unit test for this change ? Unit tests are here: https://github.com/BradnerLab/pipeline/blob/master/bamliquidator_internal/bamliquidatorbatch/test.py
@charlesylin does someone from your team have time to review these changes?
I'm sorry for my long delay and that I can't devote more time to this right now. I'll follow up with tests and rigorous review in a week or two if nobody else has time.
@Drmirdeep I discussed briefly with @charlesylin . I think you might be using a different coordinate system than the one assumed by bamliquidator. Can you please review the below article and advise me? As I understand it, most bamliquidator users expect the current behavior which uses the UCSC coordinate system, so maybe it would make sense to add an option to override the default coordinate system. I'd appreciate your recommendation here.
http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems/
No worries, everybody is quite busy. I may have some time over chrismas to make a unit test for it. Anyhow the tool demands a gff as input (1-based and fully closed) but then uses bed file logic internally (0-based and half open). Nonetheless, the interval sizes are somehow wrongly calculated in any case. With my little corrections it worked as expected for this part of the code. This I checked quite extensively already by running the internal bins which report back the coverage and the offset there is pretty obvious. I didn't take a closer look at the rest of the source which would need to much time now.
Can someone please assign themselves to review this? I'm sorry I can't for the foreseeable future. I can't just merge this because there are concerns that changing the bin size will break other people's workflow suddenly resulting in an unexpected change in behavior.
@Drmirdeep if you'd like to add an argument so that by default the bin size behavior remains unchanged, I'd feel more comfortable approving and merging this in.
@Drmirdeep you can also rebase with master to get fixes for testing, along with your change.
Binsize calculation is based on the gff input file and thus needs to add 1 for the region size. Since this was missing an offset error accumulated with the number of bins used and the end of a region was never been used for calculation. E.g. A region with 60 bins would lack the last 60bp of a region due to the offset error accumulation.