Closed bcorrie closed 1 year ago
I suppose I am saying above that it seems like we might be able to use the MiAIRR fields to be MINSEQE compliant. And if we can, we definitely should I assume...
Hmmm, I don't think so. My understanding is that we expect transcriptomic data to be linked via doi. In that case, can't we just say that we expect the doi to resolve to a MINSEQE-compliant dataset that has all the appropriate sample and data processing information? Seems like an SEP...
@scharch am I missing something about MINSEQE?
For study/subject/sample info it says:
1) The description of the biological system, samples, and the experimental variables being studied:
“compound” and “dose” in dose-response experiments or “antibody” in ChIP-Seq experiments, the organism, tissue, and the treatment(s) applied.
4) General information about the experiment and sample-data relationships:
a summary of the experiment and its goals, contact information, any associated publication, and a table specifying sample-data relationships.
It then points to a PDF for more details, which then says:
1. The description of the biological system, samples, and the experimental variables being studied.
Essential sample annotation, including the experimental factors and their values, must be given. Experimental factors are the key experimental variables, e.g. “compound” and “dose” in dose-response experiments or “antibody” in ChIP-Seq experiments. In addition to experimental factor values, essential information about the biological system from which samples were taken must be given, e.g. the organism, strain or cultivar (if known and if appropriate), the organism part or tissue, and what treatment(s) was/were applied.
4. General information about the experiment and sample-data relationships
General information about the overall study includes a summary of the experiment and its goals, contact information, and any associated publication. Part of this description should be a table specifying sample-data relationships, i.e. which sample has led to which raw data file or which data element in the processed data files.
That is it... I can't find anything else more explicit, detailed, or precise. That is a couple of loosely defined paragraphs that describes what I understand to be our relatively precise Study
, Subject
, and Sample
objects. The same is for how samples were prepared for sequencing:
5. Essential experimental and data processing protocols:
how the nucleic acid samples were isolated, purified and processed prior to sequencing, a summary of the instrumentation used, library preparation strategy, labelling and amplification methodologies, alignment algorithms and data filtering plus data processing & analysis protocols.
No spec for anything, just a paragraph... What am I missing?
@bcorrie I hadn't bothered to click through, so I don't know if you're missing anything, but probably not. Still an SEP - RNAseq is not AIRRseq...
When storing
Cell
objects associated to aRepertoire
what repertoire metadata needs to be stored and how do we do that? My question above was whether there should be a separateSampleProcessing
for theCells
. Although far from being an expert, it seems like there might be some differentCellProcessing
,NucleicAcidProcessing
,SequencingRun
,RawSequenceData
, orDataProcessing
for RNA-seq processing that generatesCell
objects?
I'm confused on whether you mean Cell
objects or gene expression data from cellranger count
, or both. You get Cell
objects from the VDJ process with cellranger vdj
. In this case, I'd expect that both Cell
and Rearrangement
data are described with a single SampleProcessing
and a single DataProcessing
. For my data, my thought was to use the same SampleProcessing and DataProcessing objects, i.e. not create a second one, if the 10x protocols also measured gene expression.
You also get Cell
objects from cellranger count
when processing gene expression data. I think using a separate SampleProcessing record to describe that "sounds" like an okay idea, but it might be confusing for users who expect AIRR SampleProcessing records to be about AIRR-seq experimental protocols. I don't think we should do that as a matter of course.
The larger question is how to handle multi-omic studies. In general, I don't think using the AIRR data structures for non-AIRR-seq data should be immediately assumed. If we wanted to use SampleProcessing
as a way to store experimental metadata for non-AIRR-seq protocols, my suggestion is not to use our Sample
, CellProcessing
, etc., objects as is. Instead to use the openapi's OneOf
and then describe new schema objects specialized to that omic. If we are lucky, we can point to a schema that a standards group has already defined for that omic data.
closing as SEP
Started this as a separate issue based on discussion here: https://github.com/airr-community/airr-standards/issues/417#issuecomment-733294309
When storing
Cell
objects associated to aRepertoire
what repertoire metadata needs to be stored and how do we do that? My question above was whether there should be a separateSampleProcessing
for theCells
. Although far from being an expert, it seems like there might be some differentCellProcessing
,NucleicAcidProcessing
,SequencingRun
,RawSequenceData
, orDataProcessing
for RNA-seq processing that generatesCell
objects?If nothing else, I would assume that there would be a different
DataProcessing
so we could capture that the Cells were produced with "software_versions" : "cellranger v4.0" etc and that "sequencing_kit" : "10X Chromium blah blah" etc?In looking at MINSEQE it seems like MiAIRR covers much of what is in MINSEQE - based on my limited experience. It appears that MINSEQE is not particularly precise in that it suggests in words (rather than as a specification) what one should gather, but it does seem to map fairly well to the MiAIRR objects...