databio / bedboss

Python pipeline for processing BED files for BEDbase
https://docs.bedbase.org
BSD 2-Clause "Simplified" License
1 stars 0 forks source link

Improve BED Classifier #60

Open donaldcampbelljr opened 5 months ago

donaldcampbelljr commented 5 months ago

This issue will track Phase 2 of the Bed Classifier system. Phase 1: https://github.com/databio/bedboss/issues/34

As we populate the database, we will find some BED files are not correctly classified. A current example from the most recent upload: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6754599

We can start with narrowpeak classification, using GEOFetch to pull narrowpeaks and run BedClassifier on these files to determine where the false negatives occur, adjust the classification algorithm, and then re-insert the new (and hopefully more accurate) classifications.

nsheff commented 1 month ago

@donaldcampbelljr can you update as to the status of this here?

donaldcampbelljr commented 1 month ago

This phase has not begun yet. The initial strategy posted above is still a good place to start, I believe. We did discuss potentially adding a column to the database that notes the discrepancy between the bedboss classification and the user/file extension classification (giving us a list of files to check). I don't believe this functionality was added to bedboss, however.