databio / bedbase

Aggregate, analyze, and serve genomic regions.
http://bedbase.org/
4 stars 0 forks source link

Roadmap for asynchronous BEDbase/BEDboss data processing #84

Open nsheff opened 1 week ago

nsheff commented 1 week ago

We need these components for asynchronous processing of BED files to populate bedbase.

  1. [x] Automatic registering of new BED files posted to GEO.
    • we are already automatically creating a PEP with new BED files using github action
    • bbuploader process reads from PEPhub (run by bedboss geo upload-all)
    • [ ] implement --light CLI arg for bedboss geo upload --light ... (compute ID, upload BED file, collect and input metadata)
    • [ ] add new github action that will run this process weekly.
    • [ ] make sure current endpoints properly ignore "stub" records, if needed (those that haven't yet gone through full/heavy process).
  2. [x] Allowing users to register new BEDsets
    • [x] endpoint for registering a BEDset
    • POST a PEPhub registry path. BEDbase retrieves this PEP, validates against BEDbase::bedset schema, and then creates a new BEDset in the database. (This relies completely on pephub auth, so no further auth is required?) This should 1. validate against PEP schema. Do other validations. Limit size for now to, say 2k BEDs? maybe limit throughput? Maybe to start, hard code a list of "allowed" pephub namespaces.
    • [x] Create a button on BEDbase cart page to "Create BEDset PEP for this Cart". (This would require user to authenticate with PEPhub...)
  3. [ ] Daemon that retrieves unprocessed BED files and processes them (plots and statistics)
    • [ ] endpoint for unprocessed files ?
    • [ ] endpoint for unprocessed plots
    • [ ] endpoint for unprocessed statistics
    • [ ] script thats hits these endpoints, and then does the correct processing (in BEDboss)
    • [ ] wrapper daemon to sleep and wrap the above script.
  4. [ ] Daemon that retrieves unprocessed BED sets and processes them.
    • [ ] endpoint for retrieving which BEDsets are registered, but not processed
    • [ ] script thats hits bedset endpoints, and then does the correct processing (in BEDboss)
    • [ ] wrapper daemon to sleep and wrap the above script.