Open arteymix opened 7 months ago
The title and description of this issue might be confusing since we do support having more than one bioassay per biomaterial.
The problem is importing RNA-seq data from our external pipeline in this situation where there are technical replicates that aren't being pooled. It's unusual to have technical replicates at all (we've seen this probably <10 times ever).
The current implementation in Gemma is geared towards multi-platform microarray data sets. Since GEO lacks any concept of a biomaterial, we have to infer the relationship from sample names (or other meta-data).
We don't like data to be in this state so we use (or create) virtual "combined platforms" for multi-chip microarray setups like Affymetrix HG-U133 A + B, which results in a one-to-one mapping between bioassays and biomaterials. It's a compromise trading off some potential loss of information against the complexity of maintaining a many-to-one relationship, which infects a lot of other code. But these are not technical replicates.
For single cell data, it's different again but I agree there might be some relevance.
We'll have to see whether addressing this is easier than my original reaction that it would be annoying/difficult/not worthwhile since we so rarely see technical replicates (until now for some reason!)
In https://github.com/PavlidisLab/GemmaCuration/issues/500 we are trying to see if we can work around it but if doing it "properly" (i.e., addressing this issue) isn't as bad as I think we should consider it. I will try to have a look.
One of the key interfaces is ExpressionDataMatrix, which has a comment about not supporting technical replicates, but that it's "possible" and that "the same BioMaterial can be used in multiple columns (supported implictly)" - meaning, there is nothing stopping it from happening.
So that's good news, but the upstream code may not be compatible with that.
It's currently not possible to import an experiment where there are multiple bioassays for the same biomaterial.
The simplest example of this scenario is a dataset with technical replicates where a given sample has been sequenced multiple times.
There's an inherent subject factor that has to be incorporated in the design. I doubt this is being done automatically.
The current workaround is to import the dataset with the -nomatch flag which will create distinct biomaterial for each bioassay and create a subject factor.
There are implications with single cell data because we intend to split a sample into sub-biomaterials. Those sub-biomaterials should share a subject factor.
Related to https://github.com/PavlidisLab/GemmaCuration/issues/500