PNNL-CompBio / coderdata

Automation scripts and benchmark dataset package for cancer drug prediction deep learning models.
BSD 3-Clause "New" or "Revised" License
11 stars 3 forks source link

Inconsistency in copy_call values #174

Closed nkoussa closed 1 month ago

nkoussa commented 1 month ago

It seems like there's inconsistencies in the naming of the copy calls between datasets.

If I run: dataset = coderdata.join_datasets("broad_sanger", "beataml", "hcmi") df_long = dataset.copy_number df_long.copy_call.unique()

I get: array(['amp', 'deep del', 'gain', 'diploid', 'het loss', 'Neutral', 'Loss', 'Gain', 'Deletion', 'Amplification'], dtype=object)

Not a big deal to handle, but just wanted to flag that for you guys!

sgosline commented 1 month ago

this shouldn't pass validation, we'll take a look!

sgosline commented 1 month ago

Oohh, i remember this. The capitalized terms are from the Sanger institute, and I wanted to retain them and not lose them, but you're right they are different. I will map as follows: Amplification -> amp Deletion -> deep del Loss -> het loss Gain -> gain Neutral -> diploid Screenshot 2024-05-10 at 7 54 21 AM