Closed ChristopherEeles closed 5 months ago
More context:
The data.frame I am using as input:
> str(mathew_griner)
Classes ‘data.table’ and 'data.frame': 16776 obs. of 27 variables:
$ Barcode : int 1 1 1 1 1 1 1 1 1 1 ...
$ Col : int 1 2 3 4 5 6 1 2 3 4 ...
$ Row : int 1 1 1 1 1 1 2 2 2 2 ...
$ viability : num 14.5 28.7 48.7 87.8 95.5 ...
$ Replicate : int 1 1 1 1 1 1 1 1 1 1 ...
$ Concentration : num 1250 1250 1250 1250 1250 ...
$ Concentration_2 : num 25 6.25 1.5625 0.3906 0.0977 ...
$ Plate : int 241 241 241 241 241 241 241 241 241 241 ...
$ Size : int 6 6 6 6 6 6 6 6 6 6 ...
$ QCScore : chr "null" "null" "null" "null" ...
$ RowSid : chr "NCGC00181170-01" "NCGC00181170-01" "NCGC00181170-01" "NCGC00181170-01" ...
$ DrugName : chr "Bendamustine" "Bendamustine" "Bendamustine" "Bendamustine" ...
$ drug_moa : chr "Antimetabolite" "Antimetabolite" "Antimetabolite" "Antimetabolite" ...
$ ColSid : chr "NCGC00187912" "NCGC00187912" "NCGC00187912" "NCGC00187912" ...
$ DrugName_2 : chr "PCI-32765" "PCI-32765" "PCI-32765" "PCI-32765" ...
$ drug_moa_2 : chr "Btk/Lck/Lyn inhibitor" "Btk/Lck/Lyn inhibitor" "Btk/Lck/Lyn inhibitor" "Btk/Lck/Lyn inhibitor" ...
$ RowIC50 : int 0 0 0 0 0 0 0 0 0 0 ...
$ ColIC50 : int 0 0 0 0 0 0 0 0 0 0 ...
$ RowConcUnit : chr "uM" "uM" "uM" "uM" ...
$ ColConcUnit : chr "uM" "uM" "uM" "uM" ...
$ Gnumber : chr "1" "1" "1" "1" ...
$ Gnumber_2 : chr "1" "1" "1" "1" ...
$ Duration : int 72 72 72 72 72 72 72 72 72 72 ...
$ CellLineName : chr "TMD8" "TMD8" "TMD8" "TMD8" ...
$ clid : chr "1" "1" "1" "1" ...
$ Tissue : chr "Lymphoma" "Lymphoma" "Lymphoma" "Lymphoma" ...
$ ReferenceDivisionTime: int 24 24 24 24 24 24 24 24 24 24 ...
> head(mathew_griner)
Barcode Col Row viability Replicate Concentration Concentration_2 Plate Size
1: 1 1 1 14.47499 1 1250 25.0000 241 6
2: 1 2 1 28.67605 1 1250 6.2500 241 6
3: 1 3 1 48.73285 1 1250 1.5625 241 6
4: 1 4 1 87.83152 1 1250 0.3906 241 6
5: 1 5 1 95.49822 1 1250 0.0977 241 6
6: 1 6 1 107.08472 1 1250 0.0000 241 6
QCScore RowSid DrugName drug_moa ColSid DrugName_2
1: null NCGC00181170-01 Bendamustine Antimetabolite NCGC00187912 PCI-32765
2: null NCGC00181170-01 Bendamustine Antimetabolite NCGC00187912 PCI-32765
3: null NCGC00181170-01 Bendamustine Antimetabolite NCGC00187912 PCI-32765
4: null NCGC00181170-01 Bendamustine Antimetabolite NCGC00187912 PCI-32765
5: null NCGC00181170-01 Bendamustine Antimetabolite NCGC00187912 PCI-32765
6: null NCGC00181170-01 Bendamustine Antimetabolite NCGC00187912 PCI-32765
drug_moa_2 RowIC50 ColIC50 RowConcUnit ColConcUnit Gnumber
1: Btk/Lck/Lyn inhibitor 0 0 uM uM 1
2: Btk/Lck/Lyn inhibitor 0 0 uM uM 1
3: Btk/Lck/Lyn inhibitor 0 0 uM uM 1
4: Btk/Lck/Lyn inhibitor 0 0 uM uM 1
5: Btk/Lck/Lyn inhibitor 0 0 uM uM 1
6: Btk/Lck/Lyn inhibitor 0 0 uM uM 1
Gnumber_2 Duration CellLineName clid Tissue ReferenceDivisionTime
1: 1 72 TMD8 1 Lymphoma 24
2: 1 72 TMD8 1 Lymphoma 24
3: 1 72 TMD8 1 Lymphoma 24
4: 1 72 TMD8 1 Lymphoma 24
5: 1 72 TMD8 1 Lymphoma 24
6: 1 72 TMD8 1 Lymphoma 24
@ChristopherEeles, can you grant access to https://github.com/bhklab/gDR-integration for @gladkia and @bczech? I bet it's the data format issue but we will have a look :).
Hi @ChristopherEeles, the best way is to use the pipeline function instead of running each helper function separately. Please try by running runDrugResponseProcessingPipeline
function.
Hi @gladkia,
I have invited yourself and @bczech to the Genentech team within our organization. You should have access to the gDR-integration repo now.
@bczech, I will try running end-to-end now and let you know if it helps.
Best, Chris
Hi @ChristopherEeles ,
Did you have a chance to check the solution by using runDrugResponseProcessingPipeline
?
Best, Bartek
Hi @bczech,
I was able to get a drug combination SummarizedExperiment
produced via runDrugResponseProcessingPipeline
. However, all the curves were constant fit, which leads me to believe I have not mapped the identifiers correctly.
If you could provide a conceptual description of your data model that would be very helpful. Specifically, what the exact definition of a bar-code is, since we will likely never have this identifiers in the published drug combination studies we are analyzing with PharmacoGx.
For example, I imagine it is a primary key used to uniquely identify some subset of more concrete identifiers (maybe drugs + concentrations + sample, or drugs + concentrations + replicates, or maybe you have some well/plate information in there as well).
With this information I can use a group by over those columns in our data sets to generate a bar-code even when it is not present.
I also don't think it's ideal to couple object creation with data modelling and would much prefer if I could use your helper functions to convert a PharmacoSet
to a gDR SummarizedExperiment
without applying curve fits, etc. Towards this end, I need to have a deeper understanding of your conceptual data model for the object (data schema).
I am planning to dig into the drugDoseResponsePipeline
function to gain a deeper understanding next week and will reverse engineer it if necessary, but any additional information you can provide would be appreciated.
Best, Chris
Also, if you have a less trivial example data set than the one in the vignette that could also be helpful. I imagine you have some additional data for testing/validating your methods internally?
Hey @ChristopherEeles. The Barcode is the unique ID of the plate used in the HTS experiment in the EnVision machine. We used to distinguish plates.
The input data we use is a set of three file types: manifest, template, and raw data. Here you can find some example (synthetic) data: https://github.com/gdrplatform/gDRtestData/tree/master/gDRtestData/inst/testdata/raw_synthetic_data/small
As the first step, we use gDRimport
package to import the data, and then we use gDRcore
for processing them.
Regarding the documentation of our data model, @MarcHafner will send you an e-mail with the docs.
Best, Bartek
Hi @bczech,
Thanks for getting back to me. I received the email from Marc and will dig into the gDR code again starting tomorrow morning.
Best, Chris
Hi @ChristopherEeles ,
Have you managed to process the data using the pipeline function?
Best, Bartek
Hi gDR team,
I was able to fit the Mathews Griner drug combination dataset (6 x 6) into a gDR
SummarizedExperiment
and can run the pipeline up to curve fitting.However, when I try to fit curves I get the error from the issue title:
Debugging through the code, it appears that for this data set the nested confounders include both
Concentration
andConcentration_2
. This is not the case for the gDR example data (from vignette) where all the experiments have a fixed dose of drug 2 (i.e.,Concentraion_2
is constant).It seems like this case has been purposefully intercepted in the
fit_SE
function. I am curious if support for full drug combination data sets is indeed not supported or perhaps I have violated an assumption about the data format when creating the object?For context about the data:
Concentration == 0
insidecreate_SE
DrugName
anddrug_moa
for both drugs to "untreated" when both concentrations are zeroAny help clarifying how the experimental design fits into your object/data model would be helpful.
Best, Chris
Additional context
My code for downloading, preprocessing and modelling this data set are available here (the repo is private, but I already invited Marc and Allison; let me know who else needs access): https://github.com/bhklab/gDR-integration
References:
Mathews Griner, L. A., Guha, R., Shinn, P., Young, R. M., Keller, J. M., Liu, D., Goldlust, I. S., Yasgar, A., McKnight, C., Boxer, M. B., Duveau, D. Y., Jiang, J.-K., Michael, S., Mierzwa, T., Huang, W., Walsh, M. J., Mott, B. T., Patel, P., Leister, W., … Thomas, C. J. (2014). High-throughput combinatorial screening identifies drugs that cooperate with ibrutinib to kill activated B-cell–like diffuse large B-cell lymphoma cells. Proceedings of the National Academy of Sciences, 111(6), 2349–2354. https://doi.org/10.1073/pnas.1311846111 (The data set used in the original
synergyfinder
publication)