Klepac-Ceraj-Lab / VKCComputing.jl

1 stars 0 forks source link

Feature Plans and Roadmap #2

Open kescobo opened 1 year ago

kescobo commented 1 year ago

Overview

We need a home for programmatic interaction with VKC lab data, including data stored on Airtable, AWS, and physical hard drives attached to lab computers like hopper and ada. This issue describes a basic outline of desired functionality and a punch-list of possible / desired features that a package or packages with such functionality would include.

Scope / need

Right now, we have separate repositories for different projects like Resonance, LEAP, and Galapagos, but all of these projects have many overlapping needs, such as querying the Airtable database, locating raw sequences or data products from various analyses. We also have the biobakery-nextflow repo containing workflows for running metagenomic sequencing data on AWS and the engaging cluster, and we need something similar for running amplicon and eventually ONT sequencing data.

I have also recently been working on unifying the use of various tools (eg metaphlan, QIIME) with compute containers. Managing and documenting all of this is necessary. Maybe using the wiki functionality?

Bikeshedding the name

This is going to go beyond just auditing sequencing files, so maybe it needs a different name? Realistically, many of the functions may come from separate packages / repos, but we'll want a central location to start from.

Feature ideas / needs

Sequencing files

Metadata / airtable

Analysis

For use in analysis repos

Lab documentation

Web API

Others?!?

Final thoughts

Please respond to this issue with comments/suggestions. I will add them here, and eventually split these things off as issues / PRs

psarwar commented 1 year ago

I just want to confirm what the end product is supposed to look like: I think what you're trying to do is have a repository of general computational needs of the lab which includes both documentation and the code required to achieve the things you indicate with !! So instead of a Repo for ECHO and have all the sequencing stuff repeat between ECHO and LEAP you're trying to do it the opposite where there is a sequencing stuff repo that users of both projects can use.

For the metadata subheading: If its not already in your plan - we should definitely have a folder of derived metadata in this shared space. For example - the table you generated to get breastfeeding info should be stored here and stratified by user too in the case of BF Kevin Version and Prioty Version esp b/c mine used manual curation and we can think of it as a type of 'raw' since its not fully code generated.

For the analysis subheading: I'm unclear on what you're trying to achieve that's different from the purpose of the bio-bakery nextflow - if we already have a separate repo for pipelines doesn't it make sense to have just one big analysis repo? Or is this meant to be more documentation heavy?

kescobo commented 1 year ago

I think what you're trying to do is have a repository of general computational needs of the lab which includes both documentation and the code required to achieve the things you indicate with !!

Yes. Probably more than one repo in the end, but one central place that aggregates it.

If its not already in your plan - we should definitely have a folder of derived metadata in this shared space

Yeah - I would consider the exports we get from the various sites as "raw data", and then any manipulations / processing is a "data product". Perhaps we should have a way of tracking what manipulations are done, version control, etc as well, but the key place is that there's some way of discovering where it lives, more than having a particular place where it lives, I think.

I'm unclear on what you're trying to achieve that's different from the purpose of the bio-bakery nextflow

The nextflow workflow just runs the biobakery tools. I'd also like to have a bit more automation to do things like syncing across the various machines, tracking progress for samples in airtable, logging fine paths, etc.

Hugemiler commented 1 year ago

I think that if you want some "quick products" of this, you should definitely try to implement easy ways to get sequence processing metadata and qc variables. Things like, given a collection of sample identifier, know which of those went through each part of the pipeline, and figures related to that (as, for examples, the read_depth, richness, number of human reads excluded).

vanjakle commented 1 year ago

Excellent start. And I like the list of the must-dos and the rest. I also would love to have a master document that points out to where things are and how to find them and how to access them. I also like moving to wiki.