Repertoire metadata for Cell objects

bcorrie commented 3 years ago

Started this as a separate issue based on discussion here: https://github.com/airr-community/airr-standards/issues/417#issuecomment-733294309

Actually @schristley, @bussec, @javh would the above be correct. If you had a 10X single cell study with both Cell data (cellranger count) and Rearrangement data (cellranger vdj), would you typically have different SampleProcessing objects for the two processes in a single Repertoire?

No, i don't think so. The pcr_target_locus is for the VDJ protocol, not for gene expression (RNA-seq) which generally does not target any specific loci. Likewise, the SampleProcessing object is designed for VDJ experimental protocols. For RNA-seq, there is already a separate standard MINSEQE

When storing Cell objects associated to a Repertoire what repertoire metadata needs to be stored and how do we do that? My question above was whether there should be a separate SampleProcessing for the Cells. Although far from being an expert, it seems like there might be some different CellProcessing, NucleicAcidProcessing, SequencingRun, RawSequenceData, or DataProcessing for RNA-seq processing that generates Cell objects?

If nothing else, I would assume that there would be a different DataProcessing so we could capture that the Cells were produced with "software_versions" : "cellranger v4.0" etc and that "sequencing_kit" : "10X Chromium blah blah" etc?

In looking at MINSEQE it seems like MiAIRR covers much of what is in MINSEQE - based on my limited experience. It appears that MINSEQE is not particularly precise in that it suggests in words (rather than as a specification) what one should gather, but it does seem to map fairly well to the MiAIRR objects...

bcorrie commented 3 years ago

I suppose I am saying above that it seems like we might be able to use the MiAIRR fields to be MINSEQE compliant. And if we can, we definitely should I assume...

scharch commented 3 years ago

Hmmm, I don't think so. My understanding is that we expect transcriptomic data to be linked via doi. In that case, can't we just say that we expect the doi to resolve to a MINSEQE-compliant dataset that has all the appropriate sample and data processing information? Seems like an SEP...

bcorrie commented 3 years ago

@scharch am I missing something about MINSEQE?

For study/subject/sample info it says:

1) The description of the biological system, samples, and the experimental variables being studied:
“compound” and “dose” in dose-response experiments or “antibody” in ChIP-Seq experiments, the organism, tissue, and the  treatment(s) applied.
4) General information about the experiment and sample-data relationships:
a summary of the experiment and its goals, contact information, any associated publication, and a table specifying sample-data relationships.

It then points to a PDF for more details, which then says:

1. The description of the biological system, samples, and the experimental variables being studied.
Essential sample annotation, including the experimental factors and their values, must be given. Experimental factors are the key experimental variables, e.g. “compound” and “dose” in dose-response experiments or “antibody” in ChIP-Seq experiments. In addition to experimental factor values, essential information about the biological system from which samples were taken must be given, e.g. the organism, strain or cultivar (if known and if appropriate), the organism part or tissue, and what treatment(s) was/were applied.

4. General information about the experiment and sample-data relationships 
General information about the overall study includes a summary of the experiment and its goals, contact information, and any associated publication. Part of this description should be a table specifying sample-data relationships, i.e. which sample has led to which raw data file or which data element in the processed data files.

That is it... I can't find anything else more explicit, detailed, or precise. That is a couple of loosely defined paragraphs that describes what I understand to be our relatively precise Study, Subject, and Sample objects. The same is for how samples were prepared for sequencing:

5. Essential experimental and data processing protocols:
how the nucleic acid samples were isolated, purified and processed prior to sequencing, a summary of the instrumentation used, library preparation strategy, labelling and amplification methodologies, alignment algorithms and data filtering plus data processing & analysis protocols.

No spec for anything, just a paragraph... What am I missing?

scharch commented 3 years ago

@bcorrie I hadn't bothered to click through, so I don't know if you're missing anything, but probably not. Still an SEP - RNAseq is not AIRRseq...

schristley commented 3 years ago

When storing Cell objects associated to a Repertoire what repertoire metadata needs to be stored and how do we do that? My question above was whether there should be a separate SampleProcessing for the Cells. Although far from being an expert, it seems like there might be some different CellProcessing, NucleicAcidProcessing, SequencingRun, RawSequenceData, or DataProcessing for RNA-seq processing that generates Cell objects?

I'm confused on whether you mean Cell objects or gene expression data from cellranger count, or both. You get Cell objects from the VDJ process with cellranger vdj. In this case, I'd expect that both Cell and Rearrangement data are described with a single SampleProcessing and a single DataProcessing. For my data, my thought was to use the same SampleProcessing and DataProcessing objects, i.e. not create a second one, if the 10x protocols also measured gene expression.

You also get Cell objects from cellranger count when processing gene expression data. I think using a separate SampleProcessing record to describe that "sounds" like an okay idea, but it might be confusing for users who expect AIRR SampleProcessing records to be about AIRR-seq experimental protocols. I don't think we should do that as a matter of course.

The larger question is how to handle multi-omic studies. In general, I don't think using the AIRR data structures for non-AIRR-seq data should be immediately assumed. If we wanted to use SampleProcessing as a way to store experimental metadata for non-AIRR-seq protocols, my suggestion is not to use our Sample, CellProcessing, etc., objects as is. Instead to use the openapi's OneOf and then describe new schema objects specialized to that omic. If we are lucky, we can point to a schema that a standards group has already defined for that omic data.

scharch commented 1 year ago

closing as SEP

airr-community / airr-standards

Repertoire metadata for Cell objects #485