cBioPortal / datahub

A centralized location for storing curated data from cBioPortal
171 stars 119 forks source link

coadread_tcga & ov_tcga: errors in mass spectrometry data #32

Closed sandertan closed 7 years ago

sandertan commented 7 years ago

To circumvent this problem, remove the protein_quantification files from these studies.

ea36a1c2-1dfb-11e7-92dd-63d3e3cc38c8

98fd93d0-1dfe-11e7-9eb7-40b8a079bbc8

jjgao commented 7 years ago

@pambot could you check if the mass spec data in the datahub is the same as the ones you sent to us?

pambot commented 7 years ago

@jjgao @zheins @sandertan I found the problem. It was totally my fault, one of those off by one things that I didn't catch because there were no errors and the output didn't seem off to me at the time. I made the following changes: a) corrected the concatenation problem that was causing the weird data, b) removed duplicate columns by NaN-averaging them, and c) kept missing data as NaN instead of 0. The data has been sent to Zack through Slack. Sorry about everything!

zheins commented 7 years ago

Github autoclosed the issue when I merged the PR. @sandertan if you'd like to review the data before we close this that would be great.

sandertan commented 7 years ago

Thanks @pambot ! I'll test it @zheins .

oplantalech commented 7 years ago

I reviewed the data:

When I try to validate the studies, I still find three errors:

There are still some errors, but we're almost there!

oplantalech commented 7 years ago

The errors in data_protein_quantification and data_protein_quantification_Zscores also happen in the mass spectrometry data in brca_tcga (I opened an issue for this specific study here). In the case of the file data_protein_quantification_Zscores, the problem happens when the second-to-last column has NaN as value. Could this be an error of a script that parses the data or is data missing? @zheins captura de pantalla 2017-06-22 a les 17 12 24

zheins commented 7 years ago

@oplantalech Found where the issues were stemming from - I think the files should be correct. Any NA or NaN values should be accurate.

oplantalech commented 7 years ago

@zheins Just for the record, I tested the studies again and they pass the validation and load correctly. Great job!