JetBrains-Research / big

BigWIG, BigBED and TDF for the JVM
MIT License
13 stars 2 forks source link

Replace BED score type with Int #43

Closed dievsky closed 5 years ago

dievsky commented 5 years ago

Currently it's Short, owing to the fact that UCSC defines BED score as an integer in range [0,1000]. The problem is that most BED providers don't respect these limits (MACS2 and SICER. to name a few).

I posit that our goal should be usability, not reliance to a standard no one implements, so changing score type to Int seems reasonable.

See issue https://github.com/JetBrains-Research/bioinf-commons/issues/2 as an inspiration for this PR.

iromeo commented 5 years ago

@dievsky I have a couple of questions 1) Why Integer, e.g. not Double ? 2) Could we use bed4+1 format instead of bed5 with changed score semantics? E.g. MACS2 output is BED4+6 why do we want to force it to be BED5+5 or bed BED6+4 ? May be better is to fixe auto format feature in bioinf-commons?

dievsky commented 5 years ago
  1. Because MACS2, SICER and others report an integer score. To be more specific, I've seen tons of BED files with score >= 32768, but not a single one with a non-integer score.
  2. In that case we could lose the strand, region RGB and other useful fields.
dievsky commented 5 years ago

The argument in 2 is mostly theoretical, though. SICER produces a bed4, and while MACS2 reports the strand, it's always . and can be ignored.

iromeo commented 5 years ago
  1. Because then we could lose the strand,

According to https://github.com/JetBrains-Research/bioinf-commons/issues/2 SICER support is the main point for such fix. Although SICER is in bed4+ and could work ok with current impl without this fix. MACS2 is also in BED4+6 format and we don't have any issues with strand / rgb fields.

Are you aware of real examples where such fix is required? E.g. examples with strand, rgb columns and score > 1000 ?

dievsky commented 5 years ago

As previously mentioned (and stated in the README), MACS2 is formally bed6+4 (or bed6+3 for broad peaks).

iromeo commented 5 years ago

@dievsky Ok, seems I have a use case when the fix is desired. If first lines have score < 1000 and some score > 1000 is somewhere in the middle (e.g. macs2, sicer or any other bed-like file) we cannot determine that the file is bed4+1 and parse it as bed5. So it is bioinf-commons issue but such fix simplifies bed-like files parsing

iromeo commented 5 years ago

Please fix feedback and merge pull request into the master. We could release it as a minor update.

dievsky commented 5 years ago

Since I've processed the feedback and the tests have passed, I'm merging the PR into master now.