Collect the major uses cases of cubi-tk

holtgrewe commented 3 years ago

Currently we have:

(Germline) Exomes
Single-Cell (RNA-seq)

What I see as missing

Germline Genomes
Molecular Tumor Conference (WES+RNA-seq)

Follow-up will be to create a ticket to document each use case.

january3 commented 3 years ago

RNA-Seq (separately from MTC)

ericblanc20 commented 3 years ago

DKFZ sequencing to SODAR is another. I get many projects (bulk mRNA_seq, WGS, & WES, single cell RNA, cancer or not) which data been sequenced there. The good thing is that DKFZ provides several metadata files with a relatively fixed format, that describe the libraries.

Manuela & I are looking at it. The plan is to:

Use the metafile information automatically downloaded with the data to generate an assay Isa-Tab file,
To use the same metafile information to create a Samplesheet (?), and
To upload the raw data (excluding the reads unassigned by the demultiplexing step

The user should then only create an investigation & sample Isa-Tab

bobermayer commented 3 years ago

yes, that's an important issue. my problem with DKFZ data is that the read files do not follow the bcl2fastq convention sample_S[0-9]*_L00[0-9]_[IR][123]_001.fastq.gz but sth else and I had to rename them because cellranger doesn't accept other formats

mbenary commented 3 years ago

Yes, we also have renaming steps for the DKTK-Master, but they are hidden in other documentation steps. As the renaming comes up with all data from Heidelberg, it should probably go into cubi-tk.

mbenary commented 3 years ago

Also, I get more and more panel sequencing data. And they come in different flavors: starting from fastq-files up to finished projects where only the vcf needs to go into cbioportal.

eudesbarbosa commented 3 years ago

Improve usability and documentation of setup calls:

snappy-start-project
snappy-start-step
snappy-refresh-step

holtgrewe commented 3 years ago

Has everything been mentioned in the comments now?

bobermayer commented 3 years ago

is there a need to document also multi-assay projects? at least make sure that all these use cases work also for projects with multiple assays

bobermayer commented 3 years ago

single-cell projects have subcases (RNAseq vs. ATACseq vs. Immune profiling), but that's not a separate issue

holtgrewe commented 3 years ago

Is it possible to enumerate the major multi-assay projects? We will most probably end up with enabling generic support but having something concrete would be useful. From cancer we have:

exome T/N
RNA-seq T

bobermayer commented 3 years ago

for single-cell cancer there's no clear pattern yet, but it would be good to make sure that it's always possible to specify the relevant assay

ericblanc20 commented 3 years ago

Cancer:

exome or WGS T/N
RNA-seq T
Possibly N & multiple Ts (DNA + RNA). Not often used, but might be more cases like that in the future.

mbenary commented 3 years ago

For Cancer (but more mid-long term):

adding support for proteomics data

mkuhring commented 3 years ago

For metabolomics and proteomics, there are currently no commonly repeating specific use cases (e.g. such as cancer). Study objectives are rather broad. However, methods (i.e. different mass spec technologies applied to cover different metabolites, pathways etc.) getting more diverse now. And they are more often now applied in combination. So from my point of metabolomics and proteomics are more kind of generic uses cases.

Use cases:

Metabolomics (mainly MS assays)
Proteomics (only MS assays, as far as I know)

The potential here is for now rather on supporting different assay types than study types. And thus in particular for metabolomics (and other multi-omics studies), multi-assay supports becomes more relevant. I.e. either the creation of multi-assay isa-tabs directly or by adding assays later on, as well as assay selection in later processing (e.g. when annotating).

For proteomics, I just got approached last week with their plan to integrate bigger studies to SODAR (>1000 samples). I will discuss with them, if this is a regular thing and if they would be able to provide standardized meta data. So we might actually flesh out a more specific use case rather sooner then later.

ericblanc20 commented 3 years ago

To elaborate on Mathias comments: for bulk mRNA at least, I received many generic studies, mostly small scale (up to 30 samples). For these studies, creating investigation & samples Isa-tab files is difficult, error-prone & time-consuming if we want to enforce F.A.I.R. data. In particular, the choice ontologies to describe the following examples is for me quite ambiguous:

sub-categories from experimental design EFO:0001426 or OBI:0500000
how to define a mouse strain: use OGG, Consomic Mouse Strain, specific strain?

There are many other aspects of a F.A.I.R. description of the data which require consistent choices across different studies.

I don't think it is realistic to expect that the P.I. will be able to provide us with a F.A.I.R., Isa-tab ready description of her dataset. So ideally cubi-tk would have some guidelines & templates to cover experimental design, species & strain, developmental stage & age, cell lines & cell type, tissue & organ, disease, genetic modification (knock-in, knock-out, ...), perhaps others...

Disclaimer: maybe these templates already exist, I am not quite sure how to find nor use them (sorry).

bihealth / cubi-tk

Collect the major uses cases of cubi-tk #36