PNNL-CompBio / coderdata

Automation scripts and benchmark dataset package for cancer drug prediction deep learning models.
Other
11 stars 3 forks source link

broad_sanger experiments file sources are all pharmacoGX #123

Closed jjacobson95 closed 6 months ago

jjacobson95 commented 6 months ago

In this experiments file, we expect to have numerous sources depending on which datasets these are derived from. The samples file correctly contains these numerous sources, however these can't be mapped to the experiments / drugs files by any id.

Experiments file:

Screenshot 2024-03-29 at 1 16 37 PM

Samples File:

Screenshot 2024-03-29 at 1 17 20 PM

*A note, the column names in the samples file have been slightly modified manually (other_id_source was duplicated in source column).

sgosline commented 6 months ago

I'm not sure why this isn't aligned with teh schema. I pulled the data from PharmacoGX, so that's the source. The drug data inherently comes from a different place than the omics data.

jjacobson95 commented 6 months ago

Sorry, I guess technically its aligned with schema, but just not mappable. So I can't determine which drug comes from which source (DepMap, Sanger, CCLE, etc). Not sure if its an issue for the end users or not but they won't be able to separate experiments out by source. Same with the drug figures, I won't be able to color by source.

sgosline commented 6 months ago

The experiment should map to a sample id, which comes from a source. the experiment source is different. the name of the study is preserved in the experiments file. If you provide a certain use case where the mapping fails i can be more speciifc.

jjacobson95 commented 6 months ago

The sample ids have overlap if they come from multiple sources through. So in the screenshots above, Improve_sample_id 1 maps to 5 different sources. So for example, which source does the improve_sample_id with a fit_auc value of .9987 come from?

sgosline commented 6 months ago

Thats because source and study should not be part of the sample schema it all, and i removed it. Sample 1 maps to five differnent other_id values, each which has an other_id_source.

jjacobson95 commented 6 months ago

Okay how could I tell which other_id or other_id_source the improve_sample_id with a fit_auc value of .9987 maps to?

sgosline commented 6 months ago

you join the experiment file with the sample file on improve_sample_id

jjacobson95 commented 6 months ago

But isn't it ambiguous if improve_sample_id 1 maps to multiple values depending on the row in the samples file?

sgosline commented 6 months ago

I dont see how - it's the same sample, just different 'other_id' values. if you remove the two columsn (other_id, other_id_source) and drop duplicates you'll get a single mapping. This is a known feature/bug of long data, which is the format we have gone with, mainly to save space from missing values: https://www.thedataschool.com.au/mipadmin/the-shape-of-data-long-vs-wide/

sgosline commented 6 months ago

this is by design