Closed holtgrewe closed 3 years ago
RNA-Seq (separately from MTC)
DKFZ sequencing to SODAR is another. I get many projects (bulk mRNA_seq, WGS, & WES, single cell RNA, cancer or not) which data been sequenced there. The good thing is that DKFZ provides several metadata files with a relatively fixed format, that describe the libraries.
Manuela & I are looking at it. The plan is to:
The user should then only create an investigation & sample Isa-Tab
yes, that's an important issue. my problem with DKFZ data is that the read files do not follow the bcl2fastq convention sample_S[0-9]*_L00[0-9]_[IR][123]_001.fastq.gz
but sth else and I had to rename them because cellranger doesn't accept other formats
Yes, we also have renaming steps for the DKTK-Master, but they are hidden in other documentation steps. As the renaming comes up with all data from Heidelberg, it should probably go into cubi-tk.
Also, I get more and more panel sequencing data. And they come in different flavors: starting from fastq-files up to finished projects where only the vcf needs to go into cbioportal.
Improve usability and documentation of setup calls:
snappy-start-project
snappy-start-step
snappy-refresh-step
Has everything been mentioned in the comments now?
is there a need to document also multi-assay projects? at least make sure that all these use cases work also for projects with multiple assays
single-cell projects have subcases (RNAseq vs. ATACseq vs. Immune profiling), but that's not a separate issue
Is it possible to enumerate the major multi-assay projects? We will most probably end up with enabling generic support but having something concrete would be useful. From cancer we have:
for single-cell cancer there's no clear pattern yet, but it would be good to make sure that it's always possible to specify the relevant assay
Cancer:
For Cancer (but more mid-long term):
For metabolomics and proteomics, there are currently no commonly repeating specific use cases (e.g. such as cancer). Study objectives are rather broad. However, methods (i.e. different mass spec technologies applied to cover different metabolites, pathways etc.) getting more diverse now. And they are more often now applied in combination. So from my point of metabolomics and proteomics are more kind of generic uses cases.
Use cases:
The potential here is for now rather on supporting different assay types than study types. And thus in particular for metabolomics (and other multi-omics studies), multi-assay supports becomes more relevant. I.e. either the creation of multi-assay isa-tabs directly or by adding assays later on, as well as assay selection in later processing (e.g. when annotating).
For proteomics, I just got approached last week with their plan to integrate bigger studies to SODAR (>1000 samples). I will discuss with them, if this is a regular thing and if they would be able to provide standardized meta data. So we might actually flesh out a more specific use case rather sooner then later.
To elaborate on Mathias comments: for bulk mRNA at least, I received many generic studies, mostly small scale (up to 30 samples). For these studies, creating investigation & samples Isa-tab files is difficult, error-prone & time-consuming if we want to enforce F.A.I.R. data. In particular, the choice ontologies to describe the following examples is for me quite ambiguous:
There are many other aspects of a F.A.I.R. description of the data which require consistent choices across different studies.
I don't think it is realistic to expect that the P.I. will be able to provide us with a F.A.I.R., Isa-tab ready description of her dataset. So ideally cubi-tk
would have some guidelines & templates to cover experimental design, species & strain, developmental stage & age, cell lines & cell type, tissue & organ, disease, genetic modification (knock-in, knock-out, ...), perhaps others...
Disclaimer: maybe these templates already exist, I am not quite sure how to find nor use them (sorry).
Currently we have:
What I see as missing
Follow-up will be to create a ticket to document each use case.