bhklab / PharmacoGx

R package to analyze large-scale pharmacogenomic datasets.
http://pharmacodb.pmgenomics.ca
GNU General Public License v3.0
66 stars 26 forks source link

Problem with intersectPSets #96

Open bhaibeka opened 3 years ago

bhaibeka commented 3 years ago

PharmacoGx version 2.4.0

CCLE vs GDSC1 (similar to Haibe-Kains Nature 2013)

ccle <- PharmacoGx::downloadPSet("CCLE_2015") gdsc1 <- PharmacoGx::downloadPSet("GDSC_2020(v1-8.2)")

ccle.gdsc1 <- PharmacoGx::intersectPSet(pSets=list("CCLE"=ccle, "GDSC1"=gdsc1))

returns

Intersecting large PSets may take a long time ... Error in which.min(abs(as.numeric(doses[[i]]) - min.dose)):max(which(abs(as.numeric(doses[[i]]) - : argument of length 0 In addition: Warning messages: 1: In min(as.numeric(doses[[i]]), na.rm = TRUE) : no non-missing arguments to min; returning Inf 2: In max(as.numeric(doses[[i]]), na.rm = TRUE) : no non-missing arguments to max; returning -Inf 3: In min(abs(as.numeric(doses[[i]]) - max.dose), na.rm = TRUE) : no non-missing arguments to min; returning Inf 4: In max(which(abs(as.numeric(doses[[i]]) - max.dose) == min(abs(as.numeric(doses[[i]]) - : no non-missing arguments to max; returning -Inf

bhaibeka commented 3 years ago

looks like the intersection on concentrations causes the error

ChristopherEeles commented 3 years ago

I am currently working on updating this method. Expect a fix shortly.

Best, Chris

p-smirnov commented 3 years ago

@bhaibeka I agree that it is a bug, in that it should fail more gracefully than this. The method fundamentally does not work on datasets with replicates, since we need 1-1 mapping of experiments across studies.

The only way I can think of salvaging is to first "collapse replicates", and then allow intersection. What are your thoughts on this?

bhaibeka commented 3 years ago

I would collapse the replicate and add a clear warning to inform the users

ChristopherEeles commented 3 years ago

So with the introduction of the LongTable as the class which stores sensitivity profiles, I hadn't intended to subset on dose (concentration) at all in intersect-methods. They would instead subset on features, samples and compounds (at least at the PSet level). The user could then extract the LongTable and select whichever dose they like with the very straightforward subset methods.

I can think of lots of situations where the dose is non even comparable. For example, if two PSets use different dose units. If the magnitude of those units differs (i.e., micromolar vs molar). Do we have standards in place around dose formatting?

Do you think preserving the intersection on dose is important despite these limitations? Or is it fine to have that as a secondary operation done by the user on the LongTable.

From a design perspective, I think the problem with a lot of our methods is that they try to do everything. Programming best practices indicate a function should do one operation and users are responsible for composing them into a pipeline. This is even easier now in R 4.1 with the base R pipe |>.

ChristopherEeles commented 3 years ago

So what I imagine would look like: intersect(PSet1, PSet) |> subsetLongTable(.(dose1 == dose2, , by=.(drug1id, drug2id)), )

p-smirnov commented 3 years ago

Im not exactly sure I understand what this subsetLongTable function is doing in this case.

I am 100% for making the collapsing of replicates within a PSet and the intersection of PSets into two (or more) different steps, because I think their is a lot of choices on how to collapse replicates that are possible

ChristopherEeles commented 3 years ago

In this case, subsetLongTable is a wrapper function around the subset method for a LongTable that works on a PSet. So you can subset a LongTable inside a PSet without extracting it.

If you mean the groupby, I'm selecting instances where the doses are equal for each drug combination.

ChristopherEeles commented 3 years ago
setMethod('subsetLongTable', signature(x='PSet'), function(x, subset, select) { 
    LT <- subset(sensitivitySlot(x), substitute(subset), substitute(select))
    sensitivitySlot(x) <- LT
    harmonize(x) # drop drugs from the rest of the PSet if they are lost in the subset
    return(x)
} 

Or something like that.

ChristopherEeles commented 3 years ago

As for collapsing replicates. There is already a method like that for MutliAssayExperiment. For LongTable we would do it through an assay aggregate and let the backpressure take care of the rest (i.e., when you subset an assay, if you lose a drug or sample, it is automatically dropped from the LongTable row- or colData.