Support more than one bioassay per biomaterial for RNA-seq pipeline output loading

arteymix commented 7 months ago

It's currently not possible to import an experiment where there are multiple bioassays for the same biomaterial.

The simplest example of this scenario is a dataset with technical replicates where a given sample has been sequenced multiple times.

There's an inherent subject factor that has to be incorporated in the design. I doubt this is being done automatically.

The current workaround is to import the dataset with the -nomatch flag which will create distinct biomaterial for each bioassay and create a subject factor.

There are implications with single cell data because we intend to split a sample into sub-biomaterials. Those sub-biomaterials should share a subject factor.

ppavlidis commented 7 months ago

The title and description of this issue might be confusing since we do support having more than one bioassay per biomaterial.

The problem is importing RNA-seq data from our external pipeline in this situation where there are technical replicates that aren't being pooled. It's unusual to have technical replicates at all (we've seen this probably <10 times ever).

The current implementation in Gemma is geared towards multi-platform microarray data sets. Since GEO lacks any concept of a biomaterial, we have to infer the relationship from sample names (or other meta-data).

We don't like data to be in this state so we use (or create) virtual "combined platforms" for multi-chip microarray setups like Affymetrix HG-U133 A + B, which results in a one-to-one mapping between bioassays and biomaterials. It's a compromise trading off some potential loss of information against the complexity of maintaining a many-to-one relationship, which infects a lot of other code. But these are not technical replicates.

For single cell data, it's different again but I agree there might be some relevance.

ppavlidis commented 6 months ago

We'll have to see whether addressing this is easier than my original reaction that it would be annoying/difficult/not worthwhile since we so rarely see technical replicates (until now for some reason!)

In https://github.com/PavlidisLab/GemmaCuration/issues/500 we are trying to see if we can work around it but if doing it "properly" (i.e., addressing this issue) isn't as bad as I think we should consider it. I will try to have a look.

One of the key interfaces is ExpressionDataMatrix, which has a comment about not supporting technical replicates, but that it's "possible" and that "the same BioMaterial can be used in multiple columns (supported implictly)" - meaning, there is nothing stopping it from happening.

So that's good news, but the upstream code may not be compatible with that.

PavlidisLab / Gemma

Support more than one bioassay per biomaterial for RNA-seq pipeline output loading #1052