"gDR does not yet support multiple series_identifiers, feature coming soon"

ChristopherEeles commented 1 year ago

Hi gDR team,

I was able to fit the Mathews Griner drug combination dataset (6 x 6) into a gDR SummarizedExperiment and can run the pipeline up to curve fitting.

However, when I try to fit curves I get the error from the issue title:

se <- create_and_normalize_SE(mathew_griner, nested_confounder="Barcode", readout="viability")
se_avg <- average_SE(se)
fit_SE(se_avg)
## Error in { : 
##  task 1 failed - "gDR does not yet support multiple series_identifiers, feature coming soon"

Debugging through the code, it appears that for this data set the nested confounders include both Concentration and Concentration_2. This is not the case for the gDR example data (from vignette) where all the experiments have a fixed dose of drug 2 (i.e., Concentraion_2 is constant).

It seems like this case has been purposefully intercepted in the fit_SE function. I am curious if support for full drug combination data sets is indeed not supported or perhaps I have violated an assumption about the data format when creating the object?

For context about the data:

The design of the experiment is a six by six drug combination, where each drug undergoes 4 fold dilutions from high to low for the first 5 steps and the final step is 0 concentration
- As a result, the mono-therapy dose-series are the the 6th row and column, and the viability at index 6, 6 is the untreated control for each 6 x 6 block
- The viability values are already adjusted for time zero
The controls in this data set are matched to each drug pair but it seems that the untreated controls and mono-therapy dose series are not identified by Concentration == 0 inside create_SE
- Based on the example data, I set the DrugName and drug_moa for both drugs to "untreated" when both concentrations are zero
I am not clear on what "Barcode" should represent conceptually in your data model
- I have mapped it to the Block ID for each 6x6 experiment
- If you could clarify how this column should be created (as it will be absent from all our data) it would be appreciated
- For example, I assume should be a "key" for some unique combination of other identifiers in the data but am unclear on which ones

Any help clarifying how the experimental design fits into your object/data model would be helpful.

Best, Chris

Additional context

My code for downloading, preprocessing and modelling this data set are available here (the repo is private, but I already invited Marc and Allison; let me know who else needs access): https://github.com/bhklab/gDR-integration

References:

Mathews Griner, L. A., Guha, R., Shinn, P., Young, R. M., Keller, J. M., Liu, D., Goldlust, I. S., Yasgar, A., McKnight, C., Boxer, M. B., Duveau, D. Y., Jiang, J.-K., Michael, S., Mierzwa, T., Huang, W., Walsh, M. J., Mott, B. T., Patel, P., Leister, W., … Thomas, C. J. (2014). High-throughput combinatorial screening identifies drugs that cooperate with ibrutinib to kill activated B-cell–like diffuse large B-cell lymphoma cells. Proceedings of the National Academy of Sciences, 111(6), 2349–2354. https://doi.org/10.1073/pnas.1311846111 (The data set used in the original synergyfinder publication)

ChristopherEeles commented 1 year ago

More context:

The data.frame I am using as input:

> str(mathew_griner)
Classes ‘data.table’ and 'data.frame':  16776 obs. of  27 variables:
 $ Barcode              : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Col                  : int  1 2 3 4 5 6 1 2 3 4 ...
 $ Row                  : int  1 1 1 1 1 1 2 2 2 2 ...
 $ viability            : num  14.5 28.7 48.7 87.8 95.5 ...
 $ Replicate            : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Concentration        : num  1250 1250 1250 1250 1250 ...
 $ Concentration_2      : num  25 6.25 1.5625 0.3906 0.0977 ...
 $ Plate                : int  241 241 241 241 241 241 241 241 241 241 ...
 $ Size                 : int  6 6 6 6 6 6 6 6 6 6 ...
 $ QCScore              : chr  "null" "null" "null" "null" ...
 $ RowSid               : chr  "NCGC00181170-01" "NCGC00181170-01" "NCGC00181170-01" "NCGC00181170-01" ...
 $ DrugName             : chr  "Bendamustine" "Bendamustine" "Bendamustine" "Bendamustine" ...
 $ drug_moa             : chr  "Antimetabolite" "Antimetabolite" "Antimetabolite" "Antimetabolite" ...
 $ ColSid               : chr  "NCGC00187912" "NCGC00187912" "NCGC00187912" "NCGC00187912" ...
 $ DrugName_2           : chr  "PCI-32765" "PCI-32765" "PCI-32765" "PCI-32765" ...
 $ drug_moa_2           : chr  "Btk/Lck/Lyn inhibitor" "Btk/Lck/Lyn inhibitor" "Btk/Lck/Lyn inhibitor" "Btk/Lck/Lyn inhibitor" ...
 $ RowIC50              : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ColIC50              : int  0 0 0 0 0 0 0 0 0 0 ...
 $ RowConcUnit          : chr  "uM" "uM" "uM" "uM" ...
 $ ColConcUnit          : chr  "uM" "uM" "uM" "uM" ...
 $ Gnumber              : chr  "1" "1" "1" "1" ...
 $ Gnumber_2            : chr  "1" "1" "1" "1" ...
 $ Duration             : int  72 72 72 72 72 72 72 72 72 72 ...
 $ CellLineName         : chr  "TMD8" "TMD8" "TMD8" "TMD8" ...
 $ clid                 : chr  "1" "1" "1" "1" ...
 $ Tissue               : chr  "Lymphoma" "Lymphoma" "Lymphoma" "Lymphoma" ...
 $ ReferenceDivisionTime: int  24 24 24 24 24 24 24 24 24 24 ...

> head(mathew_griner)
   Barcode Col Row viability Replicate Concentration Concentration_2 Plate Size
1:       1   1   1  14.47499         1          1250         25.0000   241    6
2:       1   2   1  28.67605         1          1250          6.2500   241    6
3:       1   3   1  48.73285         1          1250          1.5625   241    6
4:       1   4   1  87.83152         1          1250          0.3906   241    6
5:       1   5   1  95.49822         1          1250          0.0977   241    6
6:       1   6   1 107.08472         1          1250          0.0000   241    6
   QCScore          RowSid     DrugName       drug_moa       ColSid DrugName_2
1:    null NCGC00181170-01 Bendamustine Antimetabolite NCGC00187912  PCI-32765
2:    null NCGC00181170-01 Bendamustine Antimetabolite NCGC00187912  PCI-32765
3:    null NCGC00181170-01 Bendamustine Antimetabolite NCGC00187912  PCI-32765
4:    null NCGC00181170-01 Bendamustine Antimetabolite NCGC00187912  PCI-32765
5:    null NCGC00181170-01 Bendamustine Antimetabolite NCGC00187912  PCI-32765
6:    null NCGC00181170-01 Bendamustine Antimetabolite NCGC00187912  PCI-32765
              drug_moa_2 RowIC50 ColIC50 RowConcUnit ColConcUnit Gnumber
1: Btk/Lck/Lyn inhibitor       0       0          uM          uM       1
2: Btk/Lck/Lyn inhibitor       0       0          uM          uM       1
3: Btk/Lck/Lyn inhibitor       0       0          uM          uM       1
4: Btk/Lck/Lyn inhibitor       0       0          uM          uM       1
5: Btk/Lck/Lyn inhibitor       0       0          uM          uM       1
6: Btk/Lck/Lyn inhibitor       0       0          uM          uM       1
   Gnumber_2 Duration CellLineName clid   Tissue ReferenceDivisionTime
1:         1       72         TMD8    1 Lymphoma                    24
2:         1       72         TMD8    1 Lymphoma                    24
3:         1       72         TMD8    1 Lymphoma                    24
4:         1       72         TMD8    1 Lymphoma                    24
5:         1       72         TMD8    1 Lymphoma                    24
6:         1       72         TMD8    1 Lymphoma                    24

gladkia commented 1 year ago

@ChristopherEeles, can you grant access to https://github.com/bhklab/gDR-integration for @gladkia and @bczech? I bet it's the data format issue but we will have a look :).

bczech commented 1 year ago

Hi @ChristopherEeles, the best way is to use the pipeline function instead of running each helper function separately. Please try by running runDrugResponseProcessingPipeline function.

ChristopherEeles commented 1 year ago

Hi @gladkia,

I have invited yourself and @bczech to the Genentech team within our organization. You should have access to the gDR-integration repo now.

@bczech, I will try running end-to-end now and let you know if it helps.

Best, Chris

bczech commented 1 year ago

Hi @ChristopherEeles ,

Did you have a chance to check the solution by using runDrugResponseProcessingPipeline?

Best, Bartek

ChristopherEeles commented 1 year ago

Hi @bczech,

I was able to get a drug combination SummarizedExperiment produced via runDrugResponseProcessingPipeline. However, all the curves were constant fit, which leads me to believe I have not mapped the identifiers correctly.

If you could provide a conceptual description of your data model that would be very helpful. Specifically, what the exact definition of a bar-code is, since we will likely never have this identifiers in the published drug combination studies we are analyzing with PharmacoGx.

For example, I imagine it is a primary key used to uniquely identify some subset of more concrete identifiers (maybe drugs + concentrations + sample, or drugs + concentrations + replicates, or maybe you have some well/plate information in there as well).

With this information I can use a group by over those columns in our data sets to generate a bar-code even when it is not present.

I also don't think it's ideal to couple object creation with data modelling and would much prefer if I could use your helper functions to convert a PharmacoSet to a gDR SummarizedExperiment without applying curve fits, etc. Towards this end, I need to have a deeper understanding of your conceptual data model for the object (data schema).

I am planning to dig into the drugDoseResponsePipeline function to gain a deeper understanding next week and will reverse engineer it if necessary, but any additional information you can provide would be appreciated.

Best, Chris

ChristopherEeles commented 1 year ago

Also, if you have a less trivial example data set than the one in the vignette that could also be helpful. I imagine you have some additional data for testing/validating your methods internally?

bczech commented 1 year ago

Hey @ChristopherEeles. The Barcode is the unique ID of the plate used in the HTS experiment in the EnVision machine. We used to distinguish plates. The input data we use is a set of three file types: manifest, template, and raw data. Here you can find some example (synthetic) data: https://github.com/gdrplatform/gDRtestData/tree/master/gDRtestData/inst/testdata/raw_synthetic_data/small As the first step, we use gDRimport package to import the data, and then we use gDRcore for processing them.

Regarding the documentation of our data model, @MarcHafner will send you an e-mail with the docs.

Best, Bartek

ChristopherEeles commented 1 year ago

Hi @bczech,

Thanks for getting back to me. I received the email from Marc and will dig into the gDR code again starting tomorrow morning.

Best, Chris

bczech commented 1 year ago

Hi @ChristopherEeles ,

Have you managed to process the data using the pipeline function?

Best, Bartek

gdrplatform / gDRcore

"gDR does not yet support multiple series_identifiers, feature coming soon" #16

Additional context

References: