cBioPortal / datahub

A centralized location for storing curated data from cBioPortal
170 stars 119 forks source link

Data issue for study: coadread_cass_2020 #2054

Open saxenanurag opened 2 weeks ago

saxenanurag commented 2 weeks ago

I am trying to import coadread_cass_2020 into a private installation of cbioportal and getting this error:

ERROR: data_clinical_patient.txt: lines [74, 108]: columns [18, 19]: Value of numeric attribute is not a real number; values encountered: ['<0.5', '<2.0']

I downloaded the files directly from cbioportal.org as well and got the same error.

alexsigaras commented 1 week ago

Thanks @saxenanurag. I can confirm we are having the same issue on our end. This refers to values of the columns CEA Biomarker and CA19-9 Antigen.

The issue is that <0.5 is not a NUMBER but a STRING instead and could be changed at line 3 at the respective columns.

Looking at https://www.cbioportal.org/study/clinicalData?id=coadread_cass_2020 it seems that the data are imported with the < and > symbols so perhaps a fix would be to change the data_clinical_patient.txt problematic definitions from NUMBER to STRING.

Kindly let us know if you would like us to open a PR instead.

rmadupuri commented 1 week ago

Hi @saxenanurag @alexsigaras, thank you for bringing this issue to our attention. I have updated the validator to accept >,< and float values as numbers (see PR #58). However, this update will be available in the next release. In the meantime, please feel free to update the column on your side to string to bypass the validator check.

alexsigaras commented 3 days ago

Thank you for your response @rmadupuri . Indeed as suggested changing NUMBER to STRING solves the issue but your solution above is a much better approach. I suggest keeping this open until the PR is part of a release