Closed sandertan closed 7 years ago
@pambot could you check if the mass spec data in the datahub is the same as the ones you sent to us?
@jjgao @zheins @sandertan I found the problem. It was totally my fault, one of those off by one things that I didn't catch because there were no errors and the output didn't seem off to me at the time. I made the following changes: a) corrected the concatenation problem that was causing the weird data, b) removed duplicate columns by NaN-averaging them, and c) kept missing data as NaN instead of 0. The data has been sent to Zack through Slack. Sorry about everything!
Github autoclosed the issue when I merged the PR. @sandertan if you'd like to review the data before we close this that would be great.
Thanks @pambot ! I'll test it @zheins .
I reviewed the data:
coadread_tcga
: data_protein_quantification
and data_protein_quantification_Zscores
do not contain duplicate columns anymore.ov_tcga data_protein_quantification.txt
: there are still some empty fields in the last column.ov_tcga data_protein_quantification_Zscores.txt
: the last column has more data than before but it still contains quite a lot of empty fields. I guess they should be filled with NaN
, but I'm not sure.When I try to validate the studies, I still find three errors:
Hugo_Symbol
must be replaced for Composite.Element.REF
in data_protein_quantification.txt
and data_protein_quantification_Zscores
of both studies.data_protein_quantification_Zscores
of coadread_tcga
have the same number of columns.data_protein_quantification
of both studies has empty fields. It seems that for all those lines, a value for a certain sample is missing.There are still some errors, but we're almost there!
The errors in data_protein_quantification
and data_protein_quantification_Zscores
also happen in the mass spectrometry data in brca_tcga
(I opened an issue for this specific study here). In the case of the file data_protein_quantification_Zscores
, the problem happens when the second-to-last column has NaN
as value. Could this be an error of a script that parses the data or is data missing? @zheins
@oplantalech Found where the issues were stemming from - I think the files should be correct. Any NA
or NaN
values should be accurate.
@zheins Just for the record, I tested the studies again and they pass the validation and load correctly. Great job!
To circumvent this problem, remove the protein_quantification files from these studies.
coadread_tcga
: Bothdata_protein_quantification
anddata_protein_quantification_Zscores
have duplicate column names, with different values in the columns (some even 0, and other >20). This does not look correct They look a bit too different to be technical replicates.ov_tcga data_protein_quantification.txt
: There's a lot of missing data, this does not seem correct. When data is missing, these values should contain NA instead of being empty.ov_tcga data_protein_quantification_Zscores.txt
: The last column misses the lower 75% of data.