Open kescobo opened 1 year ago
I just want to confirm what the end product is supposed to look like: I think what you're trying to do is have a repository of general computational needs of the lab which includes both documentation and the code required to achieve the things you indicate with !! So instead of a Repo for ECHO and have all the sequencing stuff repeat between ECHO and LEAP you're trying to do it the opposite where there is a sequencing stuff repo that users of both projects can use.
For the metadata subheading: If its not already in your plan - we should definitely have a folder of derived metadata in this shared space. For example - the table you generated to get breastfeeding info should be stored here and stratified by user too in the case of BF Kevin Version and Prioty Version esp b/c mine used manual curation and we can think of it as a type of 'raw' since its not fully code generated.
For the analysis subheading: I'm unclear on what you're trying to achieve that's different from the purpose of the bio-bakery nextflow - if we already have a separate repo for pipelines doesn't it make sense to have just one big analysis repo? Or is this meant to be more documentation heavy?
I think what you're trying to do is have a repository of general computational needs of the lab which includes both documentation and the code required to achieve the things you indicate with !!
Yes. Probably more than one repo in the end, but one central place that aggregates it.
If its not already in your plan - we should definitely have a folder of derived metadata in this shared space
Yeah - I would consider the exports we get from the various sites as "raw data", and then any manipulations / processing is a "data product". Perhaps we should have a way of tracking what manipulations are done, version control, etc as well, but the key place is that there's some way of discovering where it lives, more than having a particular place where it lives, I think.
I'm unclear on what you're trying to achieve that's different from the purpose of the bio-bakery nextflow
The nextflow workflow just runs the biobakery tools. I'd also like to have a bit more automation to do things like syncing across the various machines, tracking progress for samples in airtable, logging fine paths, etc.
I think that if you want some "quick products" of this, you should definitely try to implement easy ways to get sequence processing metadata and qc variables. Things like, given a collection of sample identifier, know which of those went through each part of the pipeline, and figures related to that (as, for examples, the read_depth, richness, number of human reads excluded).
Excellent start. And I like the list of the must-dos and the rest. I also would love to have a master document that points out to where things are and how to find them and how to access them. I also like moving to wiki.
Overview
We need a home for programmatic interaction with VKC lab data, including data stored on Airtable, AWS, and physical hard drives attached to lab computers like
hopper
andada
. This issue describes a basic outline of desired functionality and a punch-list of possible / desired features that a package or packages with such functionality would include.Scope / need
Right now, we have separate repositories for different projects like
Resonance
,LEAP
, andGalapagos
, but all of these projects have many overlapping needs, such as querying the Airtable database, locating raw sequences or data products from various analyses. We also have thebiobakery-nextflow
repo containing workflows for running metagenomic sequencing data on AWS and the engaging cluster, and we need something similar for running amplicon and eventually ONT sequencing data.I have also recently been working on unifying the use of various tools (eg
metaphlan
,QIIME
) with compute containers. Managing and documenting all of this is necessary. Maybe using the wiki functionality?Bikeshedding the name
This is going to go beyond just auditing sequencing files, so maybe it needs a different name? Realistically, many of the functions may come from separate packages / repos, but we'll want a central location to start from.
DataAudit.jlDataManagementVKCDataComputingResourcesFeature ideas / needs
Sequencing files
hopper
,ada
engaging
Metadata / airtable
Analysis
For use in analysis repos
Lab documentation
Web API
Others?!?
Final thoughts
Please respond to this issue with comments/suggestions. I will add them here, and eventually split these things off as issues / PRs