Closed dievsky closed 5 years ago
@dievsky I have a couple of questions
1) Why Integer
, e.g. not Double
?
2) Could we use bed4+1 format instead of bed5 with changed score
semantics? E.g. MACS2 output is BED4+6
why do we want to force it to be BED5+5
or bed BED6+4
? May be better is to fixe auto format feature in bioinf-commons?
The argument in 2 is mostly theoretical, though. SICER produces a bed4
, and while MACS2 reports the strand, it's always .
and can be ignored.
- Because then we could lose the strand,
According to https://github.com/JetBrains-Research/bioinf-commons/issues/2
SICER support is the main point for such fix. Although SICER is in bed4+ and could work ok with current impl without this fix. MACS2 is also in BED4+6 format and we don't have any issues with strand / rgb fields.
Are you aware of real examples where such fix is required? E.g. examples with strand
, rgb
columns and score > 1000
?
As previously mentioned (and stated in the README), MACS2 is formally bed6+4
(or bed6+3
for broad peaks).
@dievsky Ok, seems I have a use case when the fix is desired. If first lines have score < 1000 and some score > 1000 is somewhere in the middle (e.g. macs2, sicer or any other bed-like file) we cannot determine that the file is bed4+1 and parse it as bed5. So it is bioinf-commons
issue but such fix simplifies bed-like files parsing
Please fix feedback and merge pull request into the master. We could release it as a minor update.
Since I've processed the feedback and the tests have passed, I'm merging the PR into master
now.
Currently it's Short, owing to the fact that UCSC defines BED score as an integer in range [0,1000]. The problem is that most BED providers don't respect these limits (MACS2 and SICER. to name a few).
I posit that our goal should be usability, not reliance to a standard no one implements, so changing score type to Int seems reasonable.
See issue https://github.com/JetBrains-Research/bioinf-commons/issues/2 as an inspiration for this PR.