Closed jjacobson95 closed 1 week ago
You can run a distinct/unique on the output, but i'm guessing this is because the file lists every combination of cell line other_id and other_name that is possible. we might want to melt these two columns to one?
I'm noticing from the HCMI samples that the data are limited because it wasn't properly tidied.
Here is some more information on tidy data frames and the rationale behind them: https://r4ds.had.co.nz/tidy-data.html. Again, if there are duplicate rows then keep this open, otherwise please close.
Duplicate improve_sample_ids can be seen in the screenshot below. These should be unique ids.
Please note that CVCL_1588 appears at least 4 times. Overall there are ~42,000 rows with 904 improve_sample_ids. Previously the count for unique samples in broad_sanger was 1,861.