UCLouvain-CBIO / depmap

Cancer Dependency Map package
https://uclouvain-cbio.github.io/depmap/
24 stars 7 forks source link

ignore.case=TRUE leads to incorrect 20Q4 copyNumber data #61

Closed allisonvuong closed 3 years ago

allisonvuong commented 3 years ago

Hi,

I think there may be a bug in the copy number data for the 20Q4 data (Bioc-devel depmap_1.5.1). It looks like one of the ExperimentHub depmap tags is CopyNumberVariationData.

tbl <- AnnotationHub:::.db_index_load(ExperimentHub::ExperimentHub())
> tbl[3413]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       EH3964
"metadata_20Q4\rBroad Institute\rHomo sapiens\r9606\r\rMetadata for cell lines in the 20Q4 DepMap release, for 0 genes, 1812 cell lines, 35 primary diseases and 39 lineages.\r1\rTheo Killian <theodore.killian@kuleuven.vib.be>\r2020-11-25\rdepmap\rc(\"ExperimentHub\", \"ExperimentData\", \"ReproducibleResearch\", \"RepositoryData\", \"AssayDomainData\", \"CopyNumberVariationData\", \"DiseaseModel\", \"CancerData\", \"BreastCancerData\", \"ColonCancerData\", \"KidneyCancerData\", \"LeukemiaCancerData\", \"LungCancerData\", \"OvarianCancerData\", \"ProstateCancerData\", \"OrganismData\", \"Homo_sapiens_Data\", \"PackageTypeData\", \"SpecimenSource\", \"CellCulture\", \"Genome\", \"Proteome\", \"StemCell\", \"Tissue\")\rtibble\rdepmap/metadata_20Q4.rda\rhttps://ndownloader.figshare.com/files/25494443\rCSV"

Thus, when AnnotationHub::query tries to grepl for copyNumber within depmap_data_loading, it picks up all depmap entries because the search is case-insensitive. Then, because the last result happens to be the metadata, when depmap::depmap_copyNumber is called, the user is accidentally returned the metadata instead of the copy number data.

MRE:

name <- "copyNumber"
eh <- ExperimentHub::ExperimentHub()
eh1 <- AnnotationHub::query(eh, c("depmap", name), ignore.case=TRUE) # Default ignore.case=TRUE; 48 records
eh2 <- AnnotationHub::query(eh, c("depmap", name), ignore.case=FALSE) # 8 records
depmap::depmap_copyNumber()

Best, Allison

allisonvuong commented 3 years ago

Hi @lgatto, @tfkillian what do you guys think about PR #62 as a solution to this bug?

tfkillian commented 3 years ago

Indeed, Allison, you are correct. I will have to modify the data loading function to prevent this.

allisonvuong commented 3 years ago

Ok, great! Please also see the mutationCalls bug as indicated in Pull Request #62.

tfkillian commented 3 years ago

For the time being, while I work on a solution to the loading functions, you can call specific datasets by their EH number, like so: eh <- ExperimentHub(); query(eh, "depmap"); copyNumber <- eh[["EH3961"]] The EH numbers for the latest datasets won't change at least for another 6 weeks until the next Depmap update in late February. I have also attached a list of datasets with their EH numbers.

depmap_datasets_list.xlsx