bhklab / CoreGx

Shared code for both PharmacoGx and RadioGx
https://bhklab.github.io/CoreGx/
GNU General Public License v3.0
2 stars 3 forks source link

`subset,LongTable-method` corrupts referential integrity of summary assays #149

Closed ChristopherEeles closed 2 years ago

ChristopherEeles commented 2 years ago

Minimal reprex:

data(nci_TRE_small)
nci_TRE_small$sens_summary <- nci_TRE_small |>
    aggregate("sensitivity", mean(viability), by=c("drug1id", "drug2id", "cellid"))
sub_nci <- subset(nci_TRE_small, drug1id %in% unique(drug1id)[1:5])
sub_nci$sens_summary
## Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
##  Join results in 3382 rows; more than 2844 = nrow(x)+nrow(i). Check for duplicate key values in i ## each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j ## for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with ## allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

Seems like it's due to a mistake reindexing the table after the subset, since there are assayIDs for sens_summary for rows of sensitivity without any values.

ChristopherEeles commented 2 years ago

Have confirmed the issue is occurring in reindex,LongTable-method since this works:

data(nci_TRE_small)
nci_TRE_small$sens_summary <- nci_TRE_small |>
    aggregate("sensitivity", mean(viability), by=c("drug1id", "drug2id", "cellid"))
sub_nci <- subset(nci_TRE_small, drug1id %in% unique(drug1id)[1:5], reindex=FALSE)
sub_nci$sens_summary
ChristopherEeles commented 2 years ago

Furthermore, the origin appears to be in assignment of summary assays with assay<-,LongTable-method, which currently adds assayKey values to the assayIndex where the assay being summarized over has NA values. This is problematic due to the storage of NAs for all row/colKey combinations.