The data in GEO has undergone some kind of gene-wise normalization. So each gene has a similar mean. The difference in the mean of two genes is no longer a reliable indicator of relative expression levels.
This primarily affects a subset of GEO submissions using Illumina BeadArrays (we never have raw data) and various microarrays (used in one-channel mode) for which we lack or don't use raw data (if it is even present) , and there is no alternative unnormalized QT provided by the submitter within the SOFT file, and the submitter didn't really know what they were doing (e.g. used canned software like GeneSpring that had a habit of doing this).
It is not terribly common in Gemma, but we have blacklisted some studies for this reason in the past and have generally avoided them.
It evidences itself in a couple ways: (with some variation)
The sample-sample correlations have a median of around zero, with many negative correlations, instead of strongly positive (we'd generally expect values of at least 0.7 even for very noisy or variable samples)
The range of mean values across genes is very small and the distribution is symmetric (i.e. -1 to 1 on a log scale), whereas we'd normally expect values from ~0-20 for one-channel data.
As far as I can tell the impact of this is actually fairly minor, because within a gene things are okay. The differential expression analysis is fine.
But it does have some impact:
Most likely, the QT checking procedures are not going to like it.
The M-V plot is not interpretable as usual (and certainly using voom would be a bad idea, though we only do that for RNA-seq so this isn't a big deal)
Outlier detection probably won't work correctly
Fold-changes might be distorted from what we would normally expect. This could happen with other quantifications as well, but the interpretation of a log2-change of 1.0 as a two-fold change in expression might not mean the same thing as data sets with more typical scaling.
To avoid having to blacklist these experiments, but to also limit confusion, having a field in the QT to indicate something like "gene-centered" would alert curators that they shouldn't be bothered by the way the data looks, that the GEEQ score could take a hit (though the poor sample-sample correlations probably takes care of that) and the QT checker can chill.
If "gene-centered" is too specific could consider instead a more generic way to flag that the QT is "non-standard" (by our standards).
This isn't super-important, but it always pains me for experiments we have curated to muck up the works for what is mostly not a big problem.
Examples:
The data in GEO has undergone some kind of gene-wise normalization. So each gene has a similar mean. The difference in the mean of two genes is no longer a reliable indicator of relative expression levels.
This primarily affects a subset of GEO submissions using Illumina BeadArrays (we never have raw data) and various microarrays (used in one-channel mode) for which we lack or don't use raw data (if it is even present) , and there is no alternative unnormalized QT provided by the submitter within the SOFT file, and the submitter didn't really know what they were doing (e.g. used canned software like GeneSpring that had a habit of doing this).
It is not terribly common in Gemma, but we have blacklisted some studies for this reason in the past and have generally avoided them.
It evidences itself in a couple ways: (with some variation)
As far as I can tell the impact of this is actually fairly minor, because within a gene things are okay. The differential expression analysis is fine.
But it does have some impact:
To avoid having to blacklist these experiments, but to also limit confusion, having a field in the QT to indicate something like "gene-centered" would alert curators that they shouldn't be bothered by the way the data looks, that the GEEQ score could take a hit (though the poor sample-sample correlations probably takes care of that) and the QT checker can chill.
If "gene-centered" is too specific could consider instead a more generic way to flag that the QT is "non-standard" (by our standards).