Broad_Sanger Samples File has duplicate improve_sample_ids and maybe more issues

PNNL-CompBio / coderdata

Automation scripts and benchmark dataset package for cancer drug prediction deep learning models.

BSD 3-Clause "New" or "Revised" License

11 stars 3 forks source link

Broad_Sanger Samples File has duplicate improve_sample_ids and maybe more issues #164

Closed jjacobson95 closed 1 week ago

jjacobson95 commented 1 month ago

Duplicate improve_sample_ids can be seen in the screenshot below. These should be unique ids.

Please note that CVCL_1588 appears at least 4 times. Overall there are ~42,000 rows with 904 improve_sample_ids. Previously the count for unique samples in broad_sanger was 1,861.

sgosline commented 1 month ago

You can run a distinct/unique on the output, but i'm guessing this is because the file lists every combination of cell line other_id and other_name that is possible. we might want to melt these two columns to one?

sgosline commented 3 weeks ago

I'm noticing from the HCMI samples that the data are limited because it wasn't properly tidied.

Here is some more information on tidy data frames and the rationale behind them: https://r4ds.had.co.nz/tidy-data.html. Again, if there are duplicate rows then keep this open, otherwise please close.