Closed jjacobson95 closed 6 months ago
I'm not sure why this isn't aligned with teh schema. I pulled the data from PharmacoGX, so that's the source. The drug data inherently comes from a different place than the omics data.
Sorry, I guess technically its aligned with schema, but just not mappable. So I can't determine which drug comes from which source (DepMap, Sanger, CCLE, etc). Not sure if its an issue for the end users or not but they won't be able to separate experiments out by source. Same with the drug figures, I won't be able to color by source.
The experiment should map to a sample id, which comes from a source. the experiment source is different. the name of the study is preserved in the experiments file. If you provide a certain use case where the mapping fails i can be more speciifc.
The sample ids have overlap if they come from multiple sources through. So in the screenshots above, Improve_sample_id 1 maps to 5 different sources. So for example, which source does the improve_sample_id with a fit_auc value of .9987 come from?
Thats because source and study should not be part of the sample schema it all, and i removed it. Sample 1 maps to five differnent other_id
values, each which has an other_id_source
.
Okay how could I tell which other_id or other_id_source the improve_sample_id with a fit_auc value of .9987 maps to?
you join the experiment file with the sample file on improve_sample_id
But isn't it ambiguous if improve_sample_id 1 maps to multiple values depending on the row in the samples file?
I dont see how - it's the same sample, just different 'other_id' values. if you remove the two columsn (other_id, other_id_source) and drop duplicates you'll get a single mapping. This is a known feature/bug of long data, which is the format we have gone with, mainly to save space from missing values: https://www.thedataschool.com.au/mipadmin/the-shape-of-data-long-vs-wide/
this is by design
In this experiments file, we expect to have numerous sources depending on which datasets these are derived from. The samples file correctly contains these numerous sources, however these can't be mapped to the experiments / drugs files by any id.
Experiments file:
Samples File:
*A note, the column names in the samples file have been slightly modified manually (other_id_source was duplicated in source column).