MPNST Data issue - non-overlapping improve_sample_ids

PNNL-CompBio / coderdata

Automation scripts and benchmark dataset package for cancer drug prediction deep learning models.

Other

11 stars 3 forks source link

MPNST Data issue - non-overlapping improve_sample_ids #206

Closed jjacobson95 closed 2 months ago

jjacobson95 commented 2 months ago

I am currently blocked on part of the transfer learning analysis with the MPNST Data. The train/test split is failing as the input dataframe is empty due to the following issue.

There are no overlapping improve_sample_ids between mpnst_transcriptomics and mpnst_experiments - ie: no experiments/drugs map to transcriptomics.

This can be reproduced by pulling the latest data (0.1.40) and checking the intersection between these data types.

This issue is not prevalent in the other datasets.

sgosline commented 2 months ago

jeremy are you able to fix this and push a new version?

sgosline commented 2 months ago

I can't repro it on my end so please go ahead and fix so you can unblock yourself.

jjacobson95 commented 2 months ago

Haven't worked on the MPNST dataset before so it might take a bit to track down, but yes, will do.

sgosline commented 2 months ago

OK. the other option is to wait until I can get to it, in which case you can hold off on charging until I fix.

pip install coderdata
coderdata download --prefix mpnst
cat mpnst_samples.csv | cut -f 8 -d , |sort|uniq
gunzip mpnst_transcriptomics.csv.gz; cat mpnst_transcriptomics.csv | cut -f 3 -d , |sort |uniq

the samples overlap.

jjacobson95 commented 2 months ago

I can work on this or continue with the cross analysis for the other comparisons. I think/hope this was the only data related issue, but there are other bugs I'm working through with the transfer learning code.

Just let me know what I should prioritize.

No overlap here from what I see -

cat mpnst_transcriptomics.csv | cut -f 3 -d , |sort | uniq
cat mpnst_experiments.tsv | cut -f 2 | sort | uniq

sgosline commented 2 months ago

Yes, the transcriptomics was performed on the tumor samples, the drug data on the PDX-MT samples, so you have to match patient data by common name.

jjacobson95 commented 2 months ago

Ah I see, so this is the correct behavior then. There is no information or code on how this was handled in the transfer learning pipeline so I'll work on building this mapping into the code.