cBioPortal / rfc80-team

repository to hold issues for the rfc80 development / deployment team
0 stars 0 forks source link

Validate current clickhouse query #11

Open alisman opened 1 month ago

alisman commented 1 month ago

@uklineale @haynescd

Here's are a few examples from the public portal of clinical data which is declared type=numeric but has non-numerical values

image.png

You can obtain this list with a query like this. There is also clinical_sample table to look through. We need to figure out:

SELECT DISTINCT ATTR_VALUE FROM clinical_patient targ
                  JOIN clinical_attribute_meta cam on targ.ATTR_ID = cam.ATTR_ID
WHERE DATATYPE = 'number'
  AND NOT REGEXP_LIKE(ATTR_VALUE, '^-?[0-9.]+$')
UNION
SELECT DISTINCT ATTR_VALUE FROM clinical_sample targ
                                    JOIN clinical_attribute_meta cam on targ.ATTR_ID = cam.ATTR_ID
WHERE DATATYPE = 'number'
  AND NOT REGEXP_LIKE(ATTR_VALUE, '^-?[0-9.]+$')