PavlidisLab / Gemma

Genomics data re-analysis
Apache License 2.0
23 stars 6 forks source link

Addressing the "preferred QT" confusion #620

Open ppavlidis opened 1 year ago

ppavlidis commented 1 year ago

There is confusion about the semantics of "preferred" and there might be a basic error in how this works (probably my fault)

The desired behaviour is for preferred to just be a flag to indicate which "raw" expression data is intended to be used as the primary source for "processed" data. The only reason this exists is because some data sets (I'm looking at you, two-color microarrays) there are often multiple numeric quantitation types from GEO. GEO always (?) calls a column "VALUE" in the SOFT file to point us at it.

Then, when we make processed data, we really don't need to care if the processed data has the preferred QT flag. There is only one "expression matrix" for the processed data, so there is no need to distinguish among QTs.

I'm a little confused about what is going on. I don't know if all of the following are violated, but the following behaviour is needed:

I'm marking this as a bug though some of the above are enhancements. I'd say the most downstream actual bug is the thing about the multiQT badge.

arteymix commented 1 year ago

In our current data model, raw and processed data are treated as two separate relations: ExpressionExperiment.rawExpressionDataVectors and ExpressionExperiment.processedExpressionDataVectors. In addition, their QT are marked as preferred via two different corresponding boolean fields: isPreferred and isMaskPreferred. I learned that quite painfully when I had to generate the data download endpoints for the REST API.

In cases where you don't specify the QT, the preferred raw or preferred processed vectors should be used if available. There should never be more than one preferred of each type.

Here's the thing though, sometimes vectors are removed, but the QT attached to the EE is still there since it's a denormalization, so the multi-QT badge can be spurious. I notably fixed dangling QT in 1c9214d2ba7ac810cdda1f67db85f8baa01f2387 when we replace raw vectors.

I think the multi-QT badge is only applicable to raw data.

Most of our code work under the assumption that there is one preferred processed QT (via isMaskPreferred) and most methods that do not expect the QT will retrieve it this way.

To capture your idea of a single processed QT that is by-default preferred per dataset, we would have to get rid of isMaskPreferred and make two subclasses of QuantitationType for raw and processed vectors and use those subclasses in the vectors.

The new data model would be:

class ExpressionExperiment extends BioAssaySet {
    List<QuantitationType> quantitationTypes; // mixture of raw and processed QTs (denormalization
    Set<RawExpressionDataVector> rawExpressionDataVectors; // groupable by QT, one is preferred
    Set<ProcessedExpressionDataVector> processedExpressionDataVectors; // only one
}

class QuantitationType {
    ScaleType scale
    Representation representation
}

class RawQuantitationType extends QuantitationType {
    boolean isPreferred // (at most one per ExpressionExperiment)
}

class ProcessedQuantitationType extends QuantitationType {
    // nothing, just basic QT stuff
}

class DataVector<T extends QuantitationType> {
    byte[] data
    T quatitationType
}

class RawExpressionDataVector extends DataVector<RawQuantitationType> {

}

class ProcessedExpressionDataVector extends DataVector<ProcessedQuantitationType> {

}

That would require an additional class column in the database to distinguish between raw and processed QT type, which is actually a pretty good thing because we are essentially guessing right now.

If we can confirm that there is only one processed QT & set of vectors per datasets, we could start deprecating isMaskPreferred right away and start simplifying some of our logic.

ppavlidis commented 1 year ago

Sounds good, I forgot about the masked-preferred thing -- it is not exposed in the web interface and this probably a big cause of the confusion.

You can test the multiqt behaviour in the web interface, if you make the processed QT and a raw QT preferred the badge shows. Anyway it will be moot. We might need to do some fixing in the database, I'm worried there are datasets where there is no raw preferred QT right now.

Also remember that QTs are used for entities other than expressiondatavectors, so the classes need to support that.

arteymix commented 1 year ago

Deprecated in f634ff20b243373b71424f795da2a24a7221d6b8. There's going to be a lot of work to make DataVector parametrized on by a QT, so we should aim at fixing this in the 1.31 series.

ppavlidis commented 1 year ago

Here is an example of something that is going wrong and which I find confusing - I don't know how to fix it.

https://gemma.msl.ubc.ca/expressionExperiment/showExpressionExperiment.html?shortName=GSE11135

This data set was long flagged as unusable but it turns out the CEL files are available and it finally worked.

However, postprocessing fails because there are now two preferred QTs. However, one of those is a raw data type, and one is the processed data. Butmaking the processed data non-preferred doesn't fix it

2023-04-03 17:24:26,680 ERROR 29856 [main] ubic.gemma.core.util.AbstractCLI.addErrorObject(411) | ExpressionExperiment Id=7201 Name=The MILE (Microarray Innovations In LEukemia) study pre-phase (GSE11135): Cannot pass vectors from more than one quantitation t\
ype
java.lang.IllegalArgumentException: Cannot pass vectors from more than one quantitation type
        at ubic.gemma.core.datastructure.matrix.BaseExpressionDataMatrix.selectVectors(BaseExpressionDataMatrix.java:387) ~[gemma-core-1.29.8.jar:?]
        at ubic.gemma.core.datastructure.matrix.ExpressionDataDoubleMatrix.<init>(ExpressionDataDoubleMatrix.java:77) ~[gemma-core-1.29.8.jar:?]
        at ubic.gemma.core.analysis.preprocess.PreprocessorServiceImpl.getCorrectedData(PreprocessorServiceImpl.java:316) ~[gemma-core-1.29.8.jar:?]
        at ubic.gemma.core.analysis.preprocess.PreprocessorServiceImpl.batchCorrect(PreprocessorServiceImpl.java:168) ~[gemma-core-1.29.8.jar:?]
        at ubic.gemma.core.analysis.preprocess.PreprocessorServiceImpl.process(PreprocessorServiceImpl.java:126) ~[gemma-core-1.29.8.jar:?]
        at ubic.gemma.core.analysis.preprocess.PreprocessorServiceImpl.process(PreprocessorServiceImpl.java:116) ~[gemma-core-1.29.8.jar:?]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_352]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_352]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_352]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_352]
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317) ~[spring-aop-3.2.18.RELEASE.jar:3.2.18.RELEASE]
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183) ~[spring-aop-3.2.18.RELEASE.jar:3.2.18.RELEASE]
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150) ~[spring-aop-3.2.18.RELEASE.jar:3.2.18.RELEASE]
        at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:96) ~[spring-tx-3.2.18.RELEASE.jar:3.2.18.RELEASE]
        at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:260) ~[spring-tx-3.2.18.RELEASE.jar:3.2.18.RELEASE]
        at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:94) ~[spring-tx-3.2.18.RELEASE.jar:3.2.18.RELEASE]
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172) ~[spring-aop-3.2.18.RELEASE.jar:3.2.18.RELEASE]
        at org.springframework.retry.interceptor.RetryOperationsInterceptor$1.doWithRetry(RetryOperationsInterceptor.java:69) ~[spring-retry-1.0.3.RELEASE.jar:?]
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:255) ~[spring-retry-1.0.3.RELEASE.jar:?]
        at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:148) ~[spring-retry-1.0.3.RELEASE.jar:?]
        at org.springframework.retry.interceptor.RetryOperationsInterceptor.invoke(RetryOperationsInterceptor.java:90) ~[spring-retry-1.0.3.RELEASE.jar:?]
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:161) ~[spring-aop-3.2.18.RELEASE.jar:3.2.18.RELEASE]
        at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91) ~[spring-aop-3.2.18.RELEASE.jar:3.2.18.RELEASE]
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172) ~[spring-aop-3.2.18.RELEASE.jar:3.2.18.RELEASE]
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204) ~[spring-aop-3.2.18.RELEASE.jar:3.2.18.RELEASE]
        at com.sun.proxy.$Proxy161.process(Unknown Source) ~[?:?]
        at ubic.gemma.core.loader.expression.DataUpdater.postprocess(DataUpdater.java:1147) ~[gemma-core-1.29.8.jar:?]
        at ubic.gemma.core.loader.expression.DataUpdater.reprocessAffyDataFromCel(DataUpdater.java:526) ~[gemma-core-1.29.8.jar:?]
        at ubic.gemma.core.loader.expression.DataUpdater$$FastClassBySpringCGLIB$$389cfc7.invoke(<generated>) ~[spring-core-3.2.18.RELEASE.jar:?]
        at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204) ~[spring-core-3.2.18.RELEASE.jar:3.2.18.RELEASE]
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:701) ~[spring-aop-3.2.18.RELEASE.jar:3.2.18.RELEASE]
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150) ~[spring-aop-3.2.18.RELEASE.jar:3.2.18.RELEASE]
        at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91) ~[spring-aop-3.2.18.RELEASE.jar:3.2.18.RELEASE]
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172) ~[spring-aop-3.2.18.RELEASE.jar:3.2.18.RELEASE]
        at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:633) ~[spring-aop-3.2.18.RELEASE.jar:3.2.18.RELEASE]
        at ubic.gemma.core.loader.expression.DataUpdater$$EnhancerBySpringCGLIB$$a592aafb.reprocessAffyDataFromCel(<generated>) ~[spring-core-3.2.18.RELEASE.jar:?]
        at ubic.gemma.core.apps.AffyDataFromCelCli.doWork(AffyDataFromCelCli.java:163) ~[gemma-cli-1.29.8.jar:?]
        at ubic.gemma.core.util.AbstractCLI.executeCommand(AbstractCLI.java:141) ~[gemma-cli-1.29.8.jar:?]
        at ubic.gemma.core.apps.GemmaCLI.main(GemmaCLI.java:139) ~[gemma-cli-1.29.8.jar:?]
ppavlidis commented 1 year ago

By the way, as I mentioned I was worried that changing the QT class hierarchy will be complicated, so it seems easier if we just clarify the name/meaning of "masked preferred" and expose it to the web interface, and firm up the logic so that we don't have situations like the one I just encountered. Something to consider.

arteymix commented 2 weeks ago

To add on top of this, for single-cell, we will need a third subclass with an additional field for indicating which set of vectors is preferred.

In the single-cell branch, I already reworked the raw and processed vectors hierarchy so we can treat batch expression data differently from single-cell one.