PNNL-CompBio / coderdata

Automation scripts and benchmark dataset package for cancer drug prediction deep learning models.
Other
11 stars 3 forks source link

NCI60 has more than 60 cell lines in our current data #231

Closed jjacobson95 closed 1 month ago

jjacobson95 commented 1 month ago

Based on unique improve_sample_ids, our NCI60 experiments data has 92 unique samples / cell lines. As far as I am aware, the NCI60 should have 60.

Reproduce:

Download Data on the Command line:

coderdata download --prefix broad_sanger
coderdata download --prefix genes

View Data in Python

import coderdata as cd
bs = cd.DatasetLoader("broad_sanger")
len(bs.experiments[bs.experiments.study =="NCI60"].improve_sample_id.unique())
print(bs.experiments[bs.experiments.study =="NCI60"].improve_sample_id.unique())
sgosline commented 1 month ago

There should be 71. one is an nan, the rest are matching errors. i'd call this a low priority bug at this point.

jonesse3 commented 1 month ago

I agree. Since I've been looking at NCI-60, I observed this as well when looking at the October 2024 release. In the general comments here, they do mention other cell lines:

There are 60 cell lines in the current NCI60 cell line screen. There are 11 other cell lines that were part of the NCI60 screen in the past. These 71 cell lines comprise most of the public data. This data release also includes other cell lines which have been assayed at least once using the same protocols as the NCI60 cell line screen.

I refer to this table to get the 60 cell lines and the corresponding cellosaurus IDs.

sgosline commented 1 month ago

I counted data for 163 cell lines in the October 2024 dataset, 82 of them have improve_sample_ids. I propose using those 82. It seems silly to throw out data when we have valid identifiers. People can filter later on if they want.