Feature Plans and Roadmap

kescobo commented 1 year ago

Overview

We need a home for programmatic interaction with VKC lab data, including data stored on Airtable, AWS, and physical hard drives attached to lab computers like hopper and ada. This issue describes a basic outline of desired functionality and a punch-list of possible / desired features that a package or packages with such functionality would include.

Scope / need

Right now, we have separate repositories for different projects like Resonance, LEAP, and Galapagos, but all of these projects have many overlapping needs, such as querying the Airtable database, locating raw sequences or data products from various analyses. We also have the biobakery-nextflow repo containing workflows for running metagenomic sequencing data on AWS and the engaging cluster, and we need something similar for running amplicon and eventually ONT sequencing data.

I have also recently been working on unifying the use of various tools (eg metaphlan, QIIME) with compute containers. Managing and documenting all of this is necessary. Maybe using the wiki functionality?

Bikeshedding the name

This is going to go beyond just auditing sequencing files, so maybe it needs a different name? Realistically, many of the functions may come from separate packages / repos, but we'll want a central location to start from.

~~DataAudit.jl~~
~~DataManagement~~
~~VKCData~~
:heavy_check_mark: VKCComputing
~~ComputingResources~~
etc...

Feature ideas / needs

:bangbang: Definitely needed
:orange_circle: Would be nice
:blue_heart: Blue sky
:books: Documentation

Sequencing files

[ ] :bangbang: Given a list of dropbox URLs from IMR (sequencing facility), download, unpack, document, archive
[ ] :bangbang: Reconcile Airtable database with physical files on hopper, ada
[ ] :bangbang: Locate physical copies of processed files (eg functional profiles),
[ ] :orange_circle: Orchestrate AWS processing of MGX samples
[ ] :orange_circle: If files are missing, check AWS / other systems to see if they're there.
[ ] :orange_circle: Document data locations / duplications
- [ ] :blue_heart: With user-interaction, delete duplicates or initiate backups
[ ] :books: Explain how to run mgx (and eventiually QIIME) pipelines on engaging
[ ] :books: Explain how to run mgx (and eventiually QIIME) pipelines on AWS
[ ] :books: document issues with AWS

Metadata / airtable

[ ] :bangbang: query samples based on certain features (eg Project = "Resonance", age < 6 months), get list of samples / file locations
[ ] :bangbang: Audit data products (eg Metaphlan profile, kneaddata log) and update airtable,
- [ ] :orange_circle: Generate logs / commands to make it easier to complete tasks
[ ] :bangbang: scripts to generate IMR upload from template
[ ] :bangbang: Generate metadata upload files for things like SRA, Echo DAC, etc. See #1
[ ] :orange_circle: keep track of other metadata files (eg tables from Rhode Island / South Africa
[ ] :orange_circle: Automatically update airtable status columns (eg for different steps of the pipeline
[ ] :books: Explain what columns on airtable mean, what the source of truth is for each column
[ ] :blue_heart: Genie.jl app for sequencing center templates

Analysis

For use in analysis repos

[ ] :bangbang: Find and load taxonomic / functional / metabolic profiles set of samples (if provided as a list)
[ ] :bangbang: Hook in with :point_up: to be able to query + load profiles according to certain features (eg project, age)

Lab documentation

[ ] :books: Replace https://klepac-ceraj-lab.github.io/ with wiki / something else easier to use / update
[ ] :books: Set of master documents outlining where everything is and how to access it [e.g., AWS, login info and other useful info]

Web API

[ ] :blue_heart: web-accessible end-point with access via https

Others?!?

Final thoughts

Please respond to this issue with comments/suggestions. I will add them here, and eventually split these things off as issues / PRs

psarwar commented 1 year ago

I just want to confirm what the end product is supposed to look like: I think what you're trying to do is have a repository of general computational needs of the lab which includes both documentation and the code required to achieve the things you indicate with !! So instead of a Repo for ECHO and have all the sequencing stuff repeat between ECHO and LEAP you're trying to do it the opposite where there is a sequencing stuff repo that users of both projects can use.

For the metadata subheading: If its not already in your plan - we should definitely have a folder of derived metadata in this shared space. For example - the table you generated to get breastfeeding info should be stored here and stratified by user too in the case of BF Kevin Version and Prioty Version esp b/c mine used manual curation and we can think of it as a type of 'raw' since its not fully code generated.

For the analysis subheading: I'm unclear on what you're trying to achieve that's different from the purpose of the bio-bakery nextflow - if we already have a separate repo for pipelines doesn't it make sense to have just one big analysis repo? Or is this meant to be more documentation heavy?

kescobo commented 1 year ago

I think what you're trying to do is have a repository of general computational needs of the lab which includes both documentation and the code required to achieve the things you indicate with !!

Yes. Probably more than one repo in the end, but one central place that aggregates it.

If its not already in your plan - we should definitely have a folder of derived metadata in this shared space

Yeah - I would consider the exports we get from the various sites as "raw data", and then any manipulations / processing is a "data product". Perhaps we should have a way of tracking what manipulations are done, version control, etc as well, but the key place is that there's some way of discovering where it lives, more than having a particular place where it lives, I think.

I'm unclear on what you're trying to achieve that's different from the purpose of the bio-bakery nextflow

The nextflow workflow just runs the biobakery tools. I'd also like to have a bit more automation to do things like syncing across the various machines, tracking progress for samples in airtable, logging fine paths, etc.

Hugemiler commented 1 year ago

I think that if you want some "quick products" of this, you should definitely try to implement easy ways to get sequence processing metadata and qc variables. Things like, given a collection of sample identifier, know which of those went through each part of the pipeline, and figures related to that (as, for examples, the read_depth, richness, number of human reads excluded).

vanjakle commented 1 year ago

Excellent start. And I like the list of the must-dos and the rest. I also would love to have a master document that points out to where things are and how to find them and how to access them. I also like moving to wiki.

Klepac-Ceraj-Lab / VKCComputing.jl