coadread_tcga & ov_tcga: errors in mass spectrometry data

sandertan commented 7 years ago

To circumvent this problem, remove the protein_quantification files from these studies.

[x] coadread_tcga: Both data_protein_quantification and data_protein_quantification_Zscores have duplicate column names, with different values in the columns (some even 0, and other >20). This does not look correct They look a bit too different to be technical replicates.

ea36a1c2-1dfb-11e7-92dd-63d3e3cc38c8

[x] ov_tcga data_protein_quantification.txt: There's a lot of missing data, this does not seem correct. When data is missing, these values should contain NA instead of being empty.

98fd93d0-1dfe-11e7-9eb7-40b8a079bbc8

[x] ov_tcga data_protein_quantification_Zscores.txt: The last column misses the lower 75% of data.

jjgao commented 7 years ago

@pambot could you check if the mass spec data in the datahub is the same as the ones you sent to us?

pambot commented 7 years ago

@jjgao @zheins @sandertan I found the problem. It was totally my fault, one of those off by one things that I didn't catch because there were no errors and the output didn't seem off to me at the time. I made the following changes: a) corrected the concatenation problem that was causing the weird data, b) removed duplicate columns by NaN-averaging them, and c) kept missing data as NaN instead of 0. The data has been sent to Zack through Slack. Sorry about everything!

zheins commented 7 years ago

Github autoclosed the issue when I merged the PR. @sandertan if you'd like to review the data before we close this that would be great.

sandertan commented 7 years ago

Thanks @pambot ! I'll test it @zheins .

oplantalech commented 7 years ago

I reviewed the data:

coadread_tcga: data_protein_quantification and data_protein_quantification_Zscores do not contain duplicate columns anymore.
ov_tcga data_protein_quantification.txt: there are still some empty fields in the last column.
ov_tcga data_protein_quantification_Zscores.txt: the last column has more data than before but it still contains quite a lot of empty fields. I guess they should be filled with NaN, but I'm not sure.

When I try to validate the studies, I still find three errors:

The name Hugo_Symbol must be replaced for Composite.Element.REF in data_protein_quantification.txt and data_protein_quantification_Zscores of both studies.
Not all lines of data_protein_quantification_Zscores of coadread_tcga have the same number of columns.
Last column of data_protein_quantification of both studies has empty fields. It seems that for all those lines, a value for a certain sample is missing.

There are still some errors, but we're almost there!

oplantalech commented 7 years ago

The errors in data_protein_quantification and data_protein_quantification_Zscores also happen in the mass spectrometry data in brca_tcga (I opened an issue for this specific study here). In the case of the file data_protein_quantification_Zscores, the problem happens when the second-to-last column has NaN as value. Could this be an error of a script that parses the data or is data missing? @zheins captura de pantalla 2017-06-22 a les 17 12 24

zheins commented 7 years ago

@oplantalech Found where the issues were stemming from - I think the files should be correct. Any NA or NaN values should be accurate.

oplantalech commented 7 years ago

@zheins Just for the record, I tested the studies again and they pass the validation and load correctly. Great job!

cBioPortal / datahub

coadread_tcga & ov_tcga: errors in mass spectrometry data #32