Specify different flavors of BED - Githubissues

databio / bedboss

Python pipeline for processing BED files for BEDbase

https://docs.bedbase.org

BSD 2-Clause "Simplified" License

1 stars 0 forks source link

Specify different flavors of BED #34

Closed nsheff closed 5 months ago

nsheff commented 9 months ago

BEDbase is holding lots of files that are BED and some that are BED-like (narrowpeak, etc).

We should do some analysis of the file to determine what format it is, and make this available as part of the information about the BED file.

To do this, we would need:

[x] determine a set of bed-like formats we should classify. everything that doesn't fit into these becomes 'unknown bed' or something
[x] write a tool that can assign the format to a particular file
[x] deploy this tool in bedqc or bedstat, so it will report the assigned format.
[x] this should make it automatically become available in the metadata endpoint (Release bedhost v0.3.0)

nsheff commented 9 months ago

This is related to databio/bedbase#54

donaldcampbelljr commented 7 months ago

Some notes:

Currently, we already specify the input_type to the bedboss pipeline (it is required). The pipeline also has an optional boolean flag to indicate if the file is narrowPeak

Supported input types for bedmaker to convert to BED format:

bedGraph
bigBed
bigWig
wig

This implies that the above are known prior to running the pipeline and would be easy to add as a column to the database (I assume it already is).

Unknowns to determine during pipeline: -is the BED file a BED3, a BED 7, etc. -is it a broad peak? -is it in BED detail format?

Questions:

Does the pipeline need to support GFF formats?
Can the pipeline handle file extensions that indicate the extra columns, e.g. .bed vs .bed7?
What happens if an input file says it is one thing but we realize its something else?

nsheff commented 7 months ago

Does the pipeline need to support GFF formats?

No. But if you think it makes sense and is easy to do, I would add it if it can be done with little additional effort.

Can the pipeline handle file extensions that indicate the extra columns, e.g. .bed vs .bed7?

are these officially approved extensions? using extensions as hints seems fine to me, but we should not rely on them.

What happens if an input file says it is one thing but we realize its something else?

Flag it for manual followup, somehow.

donaldcampbelljr commented 7 months ago

I don't believe a .bed3 is a real format. I believe it is simply .bed file but could be referred to as a "BED3": https://genome.ucsc.edu/FAQ/FAQformat.html#format1

donaldcampbelljr commented 7 months ago

Some initial work creating bedclassifier in branch dev_assign_bed_format: https://github.com/databio/bedboss/commit/ab9974f4ba3ea7379f6b6315b27ac645305ac0ae

Will begin testing this early next week.

khoroshevskyi commented 7 months ago

this is a great example Narrow and broadpeak: https://pephub.databio.org/bedbase/gse249578?tag=samples where each sample is presented in 2 different types

donaldcampbelljr commented 6 months ago

This functionality has been merged to Dev. It will be worth investigating bed files that are classified as "unknown_bedtype" to determine what improvements can be made. I manually tested on a small subset of BED files from PEPHub: https://pephub.databio.org/bedbase

Also, during work on this issue, I found this related reading on BED specifications: https://samtools.github.io/hts-specs/BEDv1.pdf

This is in addition to the UCSC FAQ on Formats

khoroshevskyi commented 6 months ago

What about gapped peaks with 15 columns? Some info can be found here: https://www.biostars.org/p/242501/

donaldcampbelljr commented 6 months ago

Looking at some Encode files, there are some that are on Encode as a narrowpeak but do NOT have narrowpeak in the file name.

donaldcampbelljr commented 6 months ago

Some more manual testing on random Encode Bed Files:

	Current Encode	Current Bed Classifier	Bed Type
ENCFF003QKM	narrowPeak	Bed6+4	bed
ENCFF850BBV	narrowPeak	Bed6+4	bed
ENCFF760EGX	bed3+	Bed4+1	bed
ENCFF271CVJ	broadPeak	Bed6+3	bed
ENCFF736FKW	broadPeak	Bed6+3	bed
ENCFF263ESY	bed6+	Bed6+2	bed
ENCFF275JUS	bed6+	Bed6+2	bed
ENCFF121IMX	broadPeak	Bed6+3	bed
ENCFF140VSE	narrowPeak	Bed7+3	bed

It appears that we could continue to improve the classification algo with regard to bed type.

donaldcampbelljr commented 6 months ago

Ok, I have this working better now:

	Current Encode	Current Bed Classifier	Bed Type
ENCFF003QKM	narrowPeak	Bed6+4	narrowpeak
ENCFF850BBV	narrowPeak	Bed6+4	narrowpeak
ENCFF760EGX	bed3+	Bed4+1	bed
ENCFF271CVJ	broadPeak	Bed6+3	broadpeak
ENCFF736FKW	broadPeak	Bed6+3	broadpeak
ENCFF263ESY	bed6+	Bed6+2	bed
ENCFF275JUS	bed6+	Bed6+2	bed
ENCFF121IMX	broadPeak	Bed6+3	broadpeak
ENCFF140VSE	narrowPeak	Bed7+3	bed

khoroshevskyi commented 6 months ago

Note: this file is bed4+6 , but we think it should be flaged as bed6+4 GSM7021587_RS-03602733_peaks.narrowPeak.gz

donaldcampbelljr commented 6 months ago

  if df[col].dtype == "int" and df[col].between(0, 1000).all():
      bedtype += 1

Looking at this bed file in particular, the 5th column (where this code is evaluating) has values greater than 1000: chr1 10008 10466 RS-03602733_peak_1 1487 . 15.13945 153.53610 148.73769 115

This 5th column represents score and ideally should be between 100-1000: https://genome.ucsc.edu/FAQ/FAQformat.html#format12

Indicates how dark the peak will be displayed in the browser (0-1000). If all scores were "'0"' when the data were submitted to the DCC, the DCC assigned scores 1-1000 based on signal value. Ideally the average signalValue per base spread is between 100-1000.

However, some searching shows that this might be related to an old MACS2 issue where these values could be reported (erroneously) over 1000: https://github.com/macs3-project/MACS/issues/123

Related reading shows a workaround that actually involves reporting the narrowpeak to bedToBigBed as a bed4+6 (which is what our classifier output) not as a bed6+4 due to these column values being greater than 1000: https://groups.google.com/a/soe.ucsc.edu/g/genome/c/wZsmrO9m0bg

Any thoughts on this? Should we remove the upper limit of 1000 and have this file be classifed as a bed6+4, narrowpeak instead?

khoroshevskyi commented 6 months ago

Hey, I found another one https://bedbase.org/bed/607821f77ab0af9bc09bb25163b4e861

This is the same issue as above.

@nsheff could you please comment issue above?

nsheff commented 6 months ago

The 0-1000 range definition is from the way UCSC displays the tracks (color intensity).

In some cases, people might be using this format in essence, but not obeying that 0-1000 range explicitly because the intent was never to visualize with the UCSC genome browser.

Does it makes sense to have another classification for these, like bed6+4+extended, or use bed6+4 but add a 'nonstandard' flag?

I think of this as a bed6+4, but a not-quite-compliant or relaxed version of it, which for many purposes would be OK, but we might just want to note that somewhere...

This leads to a new idea: what if instead of just classifying these with a single variable that can be a type, we used a more complex classification, that involved a series of variables?

khoroshevskyi commented 5 months ago

@donaldcampbelljr could you please finish this task ASAP?

donaldcampbelljr commented 5 months ago

Next steps were to assess classified bedfiles (during database rebuild) and determine where gaps exist and then do a revision/reclassification after analysis. Therefore, this will continue to be a WIP for some time.

nsheff commented 5 months ago

I think we just need to look at this in phases. @donaldcampbelljr are you finished with "phase I", which we can deploy in the next round? We can revisit/revise this of course, but it's important to have stopping points or a "freeze" where we can say it's not finished, but it's useful at this point.

donaldcampbelljr commented 5 months ago

Yes. I am finished with phase 1. The current work can be deployed in the next round.

khoroshevskyi commented 5 months ago

fixed in 0.2.0