Closed nsheff closed 5 months ago
This is related to databio/bedbase#54
Some notes:
Currently, we already specify the input_type
to the bedboss pipeline (it is required). The pipeline also has an optional boolean flag to indicate if the file is narrowPeak
Supported input types for bedmaker to convert to BED format:
This implies that the above are known prior to running the pipeline and would be easy to add as a column to the database (I assume it already is).
Unknowns to determine during pipeline: -is the BED file a BED3, a BED 7, etc. -is it a broad peak? -is it in BED detail format?
Questions:
.bed
vs .bed7
?Does the pipeline need to support GFF formats?
No. But if you think it makes sense and is easy to do, I would add it if it can be done with little additional effort.
Can the pipeline handle file extensions that indicate the extra columns, e.g. .bed vs .bed7?
are these officially approved extensions? using extensions as hints seems fine to me, but we should not rely on them.
What happens if an input file says it is one thing but we realize its something else?
Flag it for manual followup, somehow.
I don't believe a .bed3
is a real format. I believe it is simply .bed
file but could be referred to as a "BED3":
https://genome.ucsc.edu/FAQ/FAQformat.html#format1
Some initial work creating bedclassifier in branch dev_assign_bed_format
: https://github.com/databio/bedboss/commit/ab9974f4ba3ea7379f6b6315b27ac645305ac0ae
Will begin testing this early next week.
this is a great example Narrow and broadpeak: https://pephub.databio.org/bedbase/gse249578?tag=samples where each sample is presented in 2 different types
This functionality has been merged to Dev. It will be worth investigating bed files that are classified as "unknown_bedtype" to determine what improvements can be made. I manually tested on a small subset of BED files from PEPHub: https://pephub.databio.org/bedbase
Also, during work on this issue, I found this related reading on BED specifications: https://samtools.github.io/hts-specs/BEDv1.pdf
This is in addition to the UCSC FAQ on Formats
What about gapped peaks with 15 columns? Some info can be found here: https://www.biostars.org/p/242501/
Looking at some Encode files, there are some that are on Encode as a narrowpeak but do NOT have narrowpeak
in the file name.
Some more manual testing on random Encode Bed Files:
Current Encode | Current Bed Classifier | Bed Type | |
---|---|---|---|
ENCFF003QKM | narrowPeak | Bed6+4 | bed |
ENCFF850BBV | narrowPeak | Bed6+4 | bed |
ENCFF760EGX | bed3+ | Bed4+1 | bed |
ENCFF271CVJ | broadPeak | Bed6+3 | bed |
ENCFF736FKW | broadPeak | Bed6+3 | bed |
ENCFF263ESY | bed6+ | Bed6+2 | bed |
ENCFF275JUS | bed6+ | Bed6+2 | bed |
ENCFF121IMX | broadPeak | Bed6+3 | bed |
ENCFF140VSE | narrowPeak | Bed7+3 | bed |
It appears that we could continue to improve the classification algo with regard to bed type
.
Ok, I have this working better now:
Current Encode | Current Bed Classifier | Bed Type | |
---|---|---|---|
ENCFF003QKM | narrowPeak | Bed6+4 | narrowpeak |
ENCFF850BBV | narrowPeak | Bed6+4 | narrowpeak |
ENCFF760EGX | bed3+ | Bed4+1 | bed |
ENCFF271CVJ | broadPeak | Bed6+3 | broadpeak |
ENCFF736FKW | broadPeak | Bed6+3 | broadpeak |
ENCFF263ESY | bed6+ | Bed6+2 | bed |
ENCFF275JUS | bed6+ | Bed6+2 | bed |
ENCFF121IMX | broadPeak | Bed6+3 | broadpeak |
ENCFF140VSE | narrowPeak | Bed7+3 | bed |
Note: this file is bed4+6
, but we think it should be flaged as bed6+4
GSM7021587_RS-03602733_peaks.narrowPeak.gz
if df[col].dtype == "int" and df[col].between(0, 1000).all():
bedtype += 1
Looking at this bed file in particular, the 5th column (where this code is evaluating) has values greater than 1000:
chr1 10008 10466 RS-03602733_peak_1 1487 . 15.13945 153.53610 148.73769 115
This 5th column represents score
and ideally should be between 100-1000:
https://genome.ucsc.edu/FAQ/FAQformat.html#format12
Indicates how dark the peak will be displayed in the browser (0-1000). If all scores were "'0"' when the data were submitted to the DCC, the DCC assigned scores 1-1000 based on signal value. Ideally the average signalValue per base spread is between 100-1000.
However, some searching shows that this might be related to an old MACS2 issue where these values could be reported (erroneously) over 1000: https://github.com/macs3-project/MACS/issues/123
Related reading shows a workaround that actually involves reporting the narrowpeak to bedToBigBed as a bed4+6
(which is what our classifier output) not as a bed6+4
due to these column values being greater than 1000: https://groups.google.com/a/soe.ucsc.edu/g/genome/c/wZsmrO9m0bg
Any thoughts on this? Should we remove the upper limit of 1000 and have this file be classifed as a bed6+4, narrowpeak instead?
Hey, I found another one https://bedbase.org/bed/607821f77ab0af9bc09bb25163b4e861
This is the same issue as above.
@nsheff could you please comment issue above?
The 0-1000 range definition is from the way UCSC displays the tracks (color intensity).
In some cases, people might be using this format in essence, but not obeying that 0-1000 range explicitly because the intent was never to visualize with the UCSC genome browser.
Does it makes sense to have another classification for these, like bed6+4+extended, or use bed6+4 but add a 'nonstandard' flag?
I think of this as a bed6+4, but a not-quite-compliant or relaxed version of it, which for many purposes would be OK, but we might just want to note that somewhere...
This leads to a new idea: what if instead of just classifying these with a single variable that can be a type, we used a more complex classification, that involved a series of variables?
@donaldcampbelljr could you please finish this task ASAP?
Next steps were to assess classified bedfiles (during database rebuild) and determine where gaps exist and then do a revision/reclassification after analysis. Therefore, this will continue to be a WIP for some time.
I think we just need to look at this in phases. @donaldcampbelljr are you finished with "phase I", which we can deploy in the next round? We can revisit/revise this of course, but it's important to have stopping points or a "freeze" where we can say it's not finished, but it's useful at this point.
Yes. I am finished with phase 1. The current work can be deployed in the next round.
fixed in 0.2.0
BEDbase is holding lots of files that are BED and some that are BED-like (narrowpeak, etc).
We should do some analysis of the file to determine what format it is, and make this available as part of the information about the BED file.
To do this, we would need:
report
the assigned format.