cBioPortal / cbioportal

cBioPortal for Cancer Genomics
https://cbioportal.org
GNU Affero General Public License v3.0
660 stars 522 forks source link

Combination of categorical filter and filering on "unknown" in numerical produces discrepancy #11204

Open alisman opened 4 days ago

alisman commented 4 days ago

curl 'http://localhost:8082/api/column-store/filtered-samples/fetch' \ -H 'Accept: application/json' \ -H 'Accept-Language: en-US,en;q=0.9' \ -H 'Cache-Control: no-cache' \ -H 'Connection: keep-alive' \ -H 'Content-Type: application/json' \ -H 'Cookie: _ga=GA1.1.1887007066.1710956751; _ga_5260NDGD6Z=GS1.1.1721930557.11.1.1721930574.0.0.0' \ -H 'Origin: http://localhost:8082' \ -H 'Pragma: no-cache' \ -H 'Referer: http://localhost:8082/study/summary?id=genie_public' \ -H 'Sec-Fetch-Dest: empty' \ -H 'Sec-Fetch-Mode: cors' \ -H 'Sec-Fetch-Site: same-origin' \ -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36' \ -H 'sec-ch-ua: "Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"' \ -H 'sec-ch-ua-mobile: ?0' \ -H 'sec-ch-ua-platform: "macOS"' \ --data-raw '{"clinicalDataFilters":[{"attributeId":"CANCER_TYPE_DETAILED","values":[{"value":"BREAST INVASIVE DUCTAL CARCINOMA"}]},{"attributeId":"AGE_AT_SEQ_REPORT","values":[{"start":65,"end":70},{"start":70,"end":75},{"start":75,"end":80}]}],"studyIds":["genie_public"],"alterationFilter":{"copyNumberAlterationEventTypes":{"AMP":true,"HOMDEL":true},"mutationEventTypes":{"any":true},"structuralVariants":null,"includeDriver":true,"includeVUS":true,"includeUnknownOncogenicity":true,"includeUnknownTier":true,"includeGermline":true,"includeSomatic":true,"includeUnknownStatus":true,"tiersBooleanMap":{}}}'

onursumer commented 3 days ago

This curl doesn't seem to produce any discrepancy. I think it is missing the UNKNOWN filter.

Updating the filters to something like this reproduces the issue:

{"clinicalDataFilters":[{"attributeId":"CANCER_TYPE_DETAILED","values":[{"value":"BREAST INVASIVE DUCTAL CARCINOMA"}]},{"attributeId":"AGE_AT_SEQ_REPORT","values":[{"start":80,"end":85},{"start":85},{"value":"UNKNOWN"}]}],"studyIds":["genie_public"],"alterationFilter":{"copyNumberAlterationEventTypes":{"AMP":true,"HOMDEL":true},"mutationEventTypes":{"any":true},"structuralVariants":null,"includeDriver":true,"includeVUS":true,"includeUnknownOncogenicity":true,"includeUnknownTier":true,"includeGermline":true,"includeSomatic":true,"includeUnknownStatus":true,"tiersBooleanMap":{}}}
alisman commented 3 days ago

@onursumer any idea why it happens?

onursumer commented 3 days ago

Still looking into it. It happens when UNKNOWN is selected together with numerical values. It doesn't happen only UNKNOWN is selected.

onursumer commented 2 days ago

@alisman Looks like we are not properly handling the case where numerical and categorical filters are together in the same ClinicalDataFilter.

We have a method to check whether a filter is catergorical or not, but it only checks the first filter in the list. https://github.com/cBioPortal/cbioportal/blob/79dc173f271037a9c617ea78ef660fa82f2767df/src/main/resources/org/cbioportal/persistence/mybatisclickhouse/StudyViewFilterMapper.xml#L280 https://github.com/cBioPortal/cbioportal/blob/ceed01f3b907158109c09b7b78b7a1600df3e24b/src/main/java/org/cbioportal/persistence/helper/StudyViewFilterHelper.java#L84-L87

So, for example, we treat the clinical data filter below as a numerical filter because the first filter is numerical.

image

The SQL dealing with numerical filters only checks for NA and ignores any other value.

https://github.com/cBioPortal/cbioportal/blob/79dc173f271037a9c617ea78ef660fa82f2767df/src/main/resources/org/cbioportal/persistence/mybatisclickhouse/StudyViewFilterMapper.xml#L358-L364

So, for this specific example we always ignore UNKNOWN. I guess we need to somehow improve the SQL to handle both numerical and categorical filters at the same time.

onursumer commented 2 days ago

Link for the above filter: http://localhost:8082/study/summary?id=genie_public#filterJson={"clinicalDataFilters":[{"attributeId":"AGE_AT_SEQ_REPORT","values":[{"start":80,"end":85},{"start":85},{"value":"UNKNOWN"}]}],"studyIds":["genie_public"],"alterationFilter":{"copyNumberAlterationEventTypes":{"AMP":true,"HOMDEL":true},"mutationEventTypes":{"any":true},"structuralVariants":null,"includeDriver":true,"includeVUS":true,"includeUnknownOncogenicity":true,"includeUnknownTier":true,"includeGermline":true,"includeSomatic":true,"includeUnknownStatus":true,"tiersBooleanMap":{}}}

image image

The filtering result is identical this one (because we ignore UNKNOWN): http://localhost:8082/study/summary?id=genie_public#filterJson={"clinicalDataFilters":[{"attributeId":"AGE_AT_SEQ_REPORT","values":[{"start":80,"end":85},{"start":85}]}],"studyIds":["genie_public"],"alterationFilter":{"copyNumberAlterationEventTypes":{"AMP":true,"HOMDEL":true},"mutationEventTypes":{"any":true},"structuralVariants":null,"includeDriver":true,"includeVUS":true,"includeUnknownOncogenicity":true,"includeUnknownTier":true,"includeGermline":true,"includeSomatic":true,"includeUnknownStatus":true,"tiersBooleanMap":{}}}

image image