broadinstitute / lincs-cell-painting

Processed Cell Painting Data for the LINCS Drug Repurposing Project
BSD 3-Clause "New" or "Revised" License
25 stars 13 forks source link

Perturbation Metadata File - Perturbation ID and MOA #5

Closed gwaybio closed 4 years ago

gwaybio commented 4 years ago

We have additional information for each compound assayed in the Drug Repurposing Hub Cell Painting Dataset.

There are at least four files on AWS, that could all work as a reference to describe compound metadata.

File Name Columns
pert_info.txt pert_id, pert_iname, pert_type, moa
pert_iname_moa.txt pert_iname, moa, source, url, support, num_sources
pert_id_to_iname.txt pert_id, pert_iname, pert_type
pert_iname_moa_aggregated.txt pert_iname, moa, pert_iname_modified

Below I summarize each of the files

pert_info.txt

image

pert_iname_moa.txt

image

pert_id_to_iname.txt

image

pert_iname_moa_aggregated.txt

image

Confirmed pert_info.txt subset

image

gwaybio commented 4 years ago

We now need to decide how to report compound metadata information. In the past, i've used pert_info.txt (e.g. broadinstitute/cell-health#111). For completeness, my vote would be to combine pert_info.txt with information from the following columns: pert_iname_modified, source, url, support, and num_sources.

We can also add metadata for all compounds and add a boolean column (maybe cell_painted_A549) to denote which compounds we screened.

Also definitely open to suggestions and discussion!

gwaybio commented 4 years ago

cc @shntnu

shntnu commented 4 years ago

I'm looking up annotations on the Broad cluster

The annotations are going to be under the batch 2016_04_01_a549_48hr_batch1 but there are multiple folders corresponding to this batch because of the different ways we processed the data.

But there are only two folders 2016_04_01_a549_48hr_batch1 and 2016_04_01_a549_48hr_batch1_cmap_style that have annotation files. I compare the relevant files below.

metadata$ cd /cmap/imaging/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/metadata
metadata$
metadata$ diff \
> ./2016_04_01_a549_48hr_batch1/barcode_platemap.csv \
> ./2016_04_01_a549_48hr_batch1_cmap_style/barcode_platemap.csv
metadata$
metadata$ diff \
> ./2016_04_01_a549_48hr_batch1/cell_painting_dataset_moa.csv \
> ./2016_04_01_a549_48hr_batch1_cmap_style/cell_painting_dataset_moa.csv
metadata$
metadata$ diff \
> ./2016_04_01_a549_48hr_batch1_cmap_style/cell_painting_dataset_cmap_annotations_moa.csv \
> ./2016_04_01_a549_48hr_batch1/cell_painting_dataset_cmap_annotations_moa.csv
metadata$

The README has an explanation of what's what

2016_04_01_a549_48hr_batch1$ cd /cmap/imaging/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/metadata/2016_04_01_a549_48hr_batch1
2016_04_01_a549_48hr_batch1$ cat README
cell_painting_dataset_compound_list_anot.tsv    Annotation provided by Steven Corsello <corsello@broadinstitute.org>
cell_painting_dataset_moa.csv   Annotation obtained by querying Repurposing Hub (2016-12-08)
cpd_moa_and_gene_target_n90.txt Annotation provided by Lev Litichevskiy <lev@broadinstitute.org>
cell_painting_dataset_cmap_annotations_moa.csv Annotations copied from /cmap/projects/M1/annotation/pert_info_a2.txt

Note that these files were also shared with the PERISCOPE team here

Looking the run log for this dataset here, it looks like we annotated the profiles using cell_painting_dataset_moa.csv but then when running CMap's tools we decided to use the latter.

Side note: this is how cell_painting_dataset_cmap_annotations_moa.csv was created.

And this is how cell_painting_dataset_moa.csv was created i.e. by querying api.clue.io

So now the only thing to resolve is the difference between cell_painting_dataset_cmap_annotations_moa.csv and cell_painting_dataset_moa.csv. More on this next.

shntnu commented 4 years ago

Here is how they differ

First note that cell_painting_dataset_cmap_annotations_moa has only a subset of the annotations, so we can already conclude that between these two, cell_painting_dataset_moa is the more complete set.

na.omit(cell_painting_dataset_moa) %>% nrow()
#> [1] 1549
na.omit(cell_painting_dataset_cmap_annotations_moa) %>% nrow()
#> [1] 605
inner_join(na.omit(cell_painting_dataset_moa), na.omit(cell_painting_dataset_cmap_annotations_moa), by = c("pert_id", "pert_iname")) %>% nrow()
#> [1] 605
inner_join(na.omit(cell_painting_dataset_moa) %>% mutate(moa = tolower(moa)), na.omit(cell_painting_dataset_cmap_annotations_moa) %>% mutate(moa = tolower(moa)), by = c("pert_id", "pert_iname"))  %>% filter(moa.x != moa.y) %>% nrow()
#> [1] 146

Created on 2020-03-16 by the reprex package (v0.3.0)

inner_join(na.omit(cell_painting_dataset_moa) %>% mutate(moa = tolower(moa)), na.omit(cell_painting_dataset_cmap_annotations_moa) %>% mutate(moa = tolower(moa)), by = c("pert_id", "pert_iname"))  %>% filter(moa.x != moa.y) %>% knitr::kable() 

The two files disagree on 146/605 compounds. Some of these are differences are trivial (e.g. quinapril and ramipril. But most are of the category where only the first MOA listed was chosen.

I verified that for all compounds where the MOAs agree, there was only one MOA listed per compound

> inner_join(na.omit(cell_painting_dataset_moa) %>% mutate(moa = tolower(moa)), na.omit(cell_painting_dataset_cmap_annotations_moa) %>% mutate(moa = tolower(moa)), by = c("pert_id", "pert_iname"))  %>% filter(moa.x == moa.y) %>% pull(moa.x) %>% map_lgl(~str_detect(.x, "\\|")) %>% any()
[1] FALSE

For Broad's internal reference: we have a conversation about this in CMap's Slack channel here

pert_id pert_iname moa.x moa.y
BRD-K25650355 physostigmine cholinesterase inhibitor acetylcholinesterase inhibitor
BRD-K26657438 imiquimod interferon inducer|tlr agonist interferon inducer
BRD-K28542495 benzydamine prostanoid receptor antagonist membrane integrity inhibitor
BRD-K29359156 ebselen cyclooxygenase inhibitor|glutathione peroxidase agonist|h+/k+-atpase inhibitor|nitric oxide synthase inhibitor cyclooxygenase inhibitor
BRD-K29582115 ziprasidone dopamine receptor antagonist|serotonin receptor antagonist dopamine receptor antagonist
BRD-K29653726 topiramate carbonic anhydrase inhibitor|glutamate receptor antagonist|kainate receptor antagonist carbonic anhydrase inhibitor
BRD-K30480208 torasemide electrolyte reabsorption inhibitor|thromboxane receptor antagonist electrolyte reabsorption inhibitor
BRD-K15916496 clotrimazole cytochrome p450 inhibitor|imidazoline ligand cytochrome p450 inhibitor
BRD-K16444452 ibudilast leukotriene receptor antagonist|phosphodiesterase inhibitor leukotriene receptor antagonist
BRD-A17883755 lenalidomide anticancer agent antineoplastic
BRD-K99964838 bosutinib abl kinase inhibitor|bcr-abl kinase inhibitor|src inhibitor abl inhibitor
BRD-K12184916 NVP-BEZ235 mtor inhibitor|pi3k inhibitor mtor inhibitor
BRD-K32744045 disulfiram aldehyde dehydrogenase inhibitor|dna methyltransferase inhibitor|trpv agonist aldehyde dehydrogenase inhibitor
BRD-K33211335 dextromethorphan glutamate receptor antagonist|sigma receptor agonist glutamate receptor antagonist
BRD-A33447119 oxfendazole anthelmintic agent anthelmintic
BRD-A19195498 trimipramine norepinephrine reputake inhibitor|tricyclic antidepressant norepinephrine reuptake inhibitor
BRD-K19687926 lapatinib egfr inhibitor|erbb2 inhibitor egfr inhibitor
BRD-K25433859 maprotiline norepinephrine reputake inhibitor|tricyclic antidepressant norepinephrine reuptake inhibitor
BRD-K07881437 danusertib aurora kinase inhibitor|growth factor receptor inhibitor aurora kinase inhibitor
BRD-K08206212 entecavir dna replication inhibitor|reverse transcriptase inhibitor dna replication inhibitor
BRD-K10843433 phenylbutazone cyclooxygenase inhibitor|prostanoid receptor antagonist cyclooxygenase inhibitor
BRD-K10670311 sulfasalazine antirheumatic drug|nfkb pathway inhibitor antirheumatic
BRD-K80738081 resveratrol cytochrome p450 inhibitor|sirt activator cytochrome p450 inhibitor
BRD-K18895904 olanzapine dopamine receptor antagonist|serotonin receptor antagonist dopamine receptor antagonist
BRD-A18917088 estradiol estrogen receptor agonist contraceptive agent
BRD-K19462402 buflomedil adrenergic receptor antagonist|calcium channel blocker adrenergic receptor antagonist
BRD-A21858158 praziquantel anthelmintic agent anthelmintic
BRD-A24191444 ifenprodil adrenergic receptor antagonist|glutamate receptor antagonist adrenergic receptor antagonist
BRD-K23984367 sorafenib flt3 inhibitor|kit inhibitor|pdgfr tyrosine kinase receptor inhibitor|raf inhibitor|ret tyrosine kinase inhibitor|vegfr inhibitor flt3 inhibitor
BRD-K24576554 AT-9283 aurora kinase inhibitor|jak inhibitor abl inhibitor
BRD-A42759514 ornidazole antiprotozoal agent antiprotozoal
BRD-K44442813 pidotimod interferon receptor agonist|interleukin receptor agonist interferon receptor agonist
BRD-A44448661 pentobarbital barbiturate antiepileptic|gaba receptor modulator barbiturate antiepileptic
BRD-K51350053 toremifene estrogen receptor antagonist|selective estrogen receptor modulator (serm) estrogen receptor antagonist
BRD-A51382177 fosinopril angiotensin converting enzyme inhibitor ace inhibitor
BRD-K52989797 clomipramine serotonin transporter (sert) inhibitor serotonin transporter inhibitor (sert)
BRD-K53737926 amitriptyline norepinephrine inhibitor|norepinephrine reuptake inhibitor|serotonin receptor antagonist|serotonin reuptake inhibitor norepinephrine inhibitor
BRD-K54759182 dosulepin norepinephrine reuptake inhibitor|serotonin reuptake inhibitor|tricyclic antidepressant norepinephrine reuptake inhibitor
BRD-A56359832 zileuton leukotriene inhibitor|lipoxygenase inhibitor leukotriene inhibitor
BRD-K67868012 PI-103 mtor inhibitor|pi3k inhibitor mtor inhibitor
BRD-K51033547 tramadol norepinephrine reputake inhibitor|opioid receptor agonist|serotonin reuptake inhibitor norepinephrine reuptake inhibitor
BRD-A68009927 daunorubicin rna synthesis inhibitor|topoisomerase inhibitor rna synthesis inhibitor
BRD-K35960502 niclosamide dna replication inhibitor|stat inhibitor dna replication inhibitor
BRD-K36927236 glyburide atp channel blocker|insulin secretagogue|sulfonylurea sulfonylurea
BRD-K92731339 perindopril angiotensin converting enzyme inhibitor ace inhibitor
BRD-A89585551 mefloquine adenosine receptor antagonist|hemoglobin antagonist adenosine receptor antagonist
BRD-K90789829 nefazodone adrenergic inhibitor|norepinephrine reuptake inhibitor|serotonin receptor antagonist|serotonin reuptake inhibitor adrenergic inhibitor
BRD-K31342827 GF109203X pkc inhibitor cdk inhibitor
BRD-K28428262 brivanib fgfr inhibitor|vegfr inhibitor fgfr inhibitor
BRD-K32107296 temozolomide dna alkylating drug dna alkylating agent
BRD-A34255068 rolipram phosphodiesterase inhibitor interleukin receptor antagonist
BRD-A13084692 troglitazone insulin sensitizer|ppar receptor agonist insulin sensitizer
BRD-K13646352 midostaurin flt3 inhibitor|kit inhibitor|pkc inhibitor flt3 inhibitor
BRD-K13819402 desoxypeganine acetylcholinesterase inhibitor|monoamine oxidase inhibitor acetylcholinesterase inhibitor
BRD-K01638814 rilmenidine adrenergic receptor agonist|imidazoline receptor agonist adrenergic receptor agonist
BRD-K06902185 minoxidil katp activator|kir6 channel (katp) activator|vasodilator katp activator
BRD-K01815685 indole-3-carbinol aryl hydrocarbon receptor agonist|indoleamine 2,3-dioxygenase inhibitor aryl hydrocarbon receptor agonist
BRD-K02265150 amoxapine norepinephrine reputake inhibitor norepinephrine reuptake inhibitor
BRD-A02367930 ethinyl-estradiol estrogen receptor agonist|estrogenic component in oral contraceptives dna directed dna polymerase stimulant
BRD-K02404261 caffeine adenosine receptor antagonist|phosphodiesterase inhibitor adenosine receptor antagonist
BRD-A03216249 mepivacaine potassium channel blocker|sodium channel blocker potassium channel blocker
BRD-K03384561 roquinimex angiogenesis inhibitor|tnf production inhibitor angiogenesis inhibitor
BRD-K37289225 clozapine dopamine receptor antagonist|serotonin receptor antagonist dopamine receptor antagonist
BRD-K37846922 3,3'-diindolylmethane chk inhibitor|cytochrome p450 activator|indoleamine 2,3-dioxygenase inhibitor chk inhibitor
BRD-K38436528 imipramine norepinephrine reputake inhibitor|serotonin reuptake inhibitor norepinephrine reuptake inhibitor
BRD-K41260949 divalproex-sodium gaba receptor agonist hdac inhibitor
BRD-A43082555 loxoprofen cyclooxygenase inhibitor|prostanoid receptor antagonist cyclooxygenase inhibitor
BRD-A44090213 indoprofen cyclooxygenase inhibitor|prostanoid receptor antagonist cyclooxygenase inhibitor
BRD-A43882281 pinacidil atp channel activator|potassium channel activator atp channel activator
BRD-K55966568 orantinib fgfr inhibitor|pdgfr tyrosine kinase receptor inhibitor|vegfr inhibitor fgfr inhibitor
BRD-A58955223 sulforaphane anticancer agent|aryl hydrocarbon receptor antagonist antineoplastic
BRD-A39390670 rabeprazole atpase inhibitor|gastrin inhibitor atpase inhibitor
BRD-K41170226 deoxycholic-acid biliverdin reductase a activator|g protein-coupled receptor agonist biliverdin reductase a activator
BRD-A55913614 primaquine antimalarial agent|dna inhibitor antimalarial
BRD-K57545991 enalapril angiotensin converting enzyme inhibitor ace inhibitor
BRD-K59273480 propentofylline adenosine reuptake inhibitor|phosphodiesterase inhibitor adenosine reuptake inhibitor
BRD-K59369769 tozasertib aurora kinase inhibitor|bcr-abl kinase inhibitor|flt3 inhibitor|jak inhibitor aurora kinase inhibitor
BRD-K60237333 niacin nad precursor with lipid lowering effects|vitamin b nad precursor with lipid lowering effects
BRD-K51677086 erythromycin-ethylsuccinate cytochrome p450 inhibitor|protein synthesis inhibitor cytochrome p450 inhibitor
BRD-A51714012 venlafaxine adrenergic inhibitor|norepinephrine reuptake inhibitor|serotonin reuptake inhibitor adrenergic inhibitor
BRD-K53318339 vinpocetine phosphodiesterase inhibitor|sodium channel blocker phosphodiesterase inhibitor
BRD-K53857191 risperidone dopamine receptor antagonist|serotonin receptor antagonist dopamine receptor antagonist
BRD-K54416256 methimazole antithyroid agent antithyroid
BRD-K07572174 curcumin cyclooxygenase inhibitor|histone acetyltransferase inhibitor|lipoxygenase inhibitor|nfkb pathway inhibitor cyclooxygenase inhibitor
BRD-A48430263 pioglitazone insulin sensitizer|ppar receptor agonist insulin sensitizer
BRD-A50311610 meclizine constitutive androstane receptor (car) agonist car agonist
BRD-A64977602 mirtazapine adrenergic receptor antagonist|serotonin receptor antagonist adrenergic receptor antagonist
BRD-K63828191 raloxifene estrogen receptor antagonist|selective estrogen receptor modulator (serm) estrogen receptor antagonist
BRD-K67277431 picotamide thromboxane receptor antagonist|thromboxane synthase inhibitor thromboxane receptor antagonist
BRD-A48237631 mitomycin-c dna alkylating agent|dna synthesis inhibitor dna alkylating agent
BRD-K48300629 zonisamide sodium channel blocker|t-type calcium channel blocker sodium channel blocker
BRD-K49328571 dasatinib bcr-abl kinase inhibitor|ephrin inhibitor|kit inhibitor|pdgfr tyrosine kinase receptor inhibitor|src inhibitor|tyrosine kinase inhibitor bcr-abl kinase inhibitor
BRD-K49865102 PD-0325901 mek inhibitor map kinase inhibitor
BRD-K50422030 clomethiazole gaba receptor antagonist|gaba receptor modulator gaba receptor antagonist
BRD-A50675702 fipronil chloride channel blocker|gaba gated chloride channel blocker chloride channel blocker
BRD-K50398167 meclofenamic-acid cyclooxygenase inhibitor|prostanoid receptor antagonist cyclooxygenase inhibitor
BRD-A83081521 finasteride 5 alpha reductase inhibitor 5-alpha reductase inhibitor
BRD-K86930074 cediranib kit inhibitor|vegfr inhibitor kit inhibitor
BRD-A45889380 mepacrine cytokine production inhibitor|nfkb pathway inhibitor|tp53 activator cytokine production inhibitor
BRD-K47869605 podophyllotoxin microtubule inhibitor|tubulin inhibitor microtubule inhibitor
BRD-K81528515 nilotinib abl kinase inhibitor|bcr-abl kinase inhibitor abl inhibitor
BRD-K81473089 tacrine acetylcholinesterase inhibitor acetylcholine release stimulant
BRD-K76908866 CP-724714 egfr inhibitor|receptor tyrosine protein kinase inhibitor egfr inhibitor
BRD-K68103045 CGS-20625 benzodiazepine receptor agonist|gaba benzodiazepine site receptor partial agonist benzodiazepine receptor agonist
BRD-K70358946 aripiprazole serotonin receptor agonist|serotonin receptor antagonist serotonin receptor agonist
BRD-K70778732 trazodone adrenergic receptor antagonist|serotonin receptor antagonist|serotonin reuptake inhibitor adrenergic receptor antagonist
BRD-K72222507 quinapril angiotensin converting enzyme inhibitor ace inhibitor
BRD-K68488863 ENMD-2076 aurora kinase inhibitor|flt3 inhibitor|vegfr inhibitor aurora kinase inhibitor
BRD-K70557564 zosuquidar p glycoprotein inhibitor p-glycoprotein inhibitor
BRD-K71035033 masitinib kit inhibitor|pdgfr tyrosine kinase receptor inhibitor|src inhibitor kit inhibitor
BRD-A68281735 REV-5901 leukotriene receptor antagonist|lipoxygenase inhibitor leukotriene receptor antagonist
BRD-K68867920 quetiapine dopamine receptor antagonist|serotonin receptor antagonist dopamine receptor antagonist
BRD-K35189033 levonorgestrel estrogen receptor agonist|glucocorticoid receptor antagonist|progesterone receptor agonist|progesterone receptor antagonist estrogen receptor agonist
BRD-A70083328 secnidazole acetylcholinesterase inhibitor|microtubule inhibitor acetylcholinesterase inhibitor
BRD-K71103788 duloxetine norepinephrine reuptake inhibitor|serotonin reuptake inhibitor norepinephrine reuptake inhibitor
BRD-K74305673 IKK-2-inhibitor-V ikk inhibitor|nfkb pathway inhibitor ikk inhibitor
BRD-A75172220 hydrocortisone glucocorticoid receptor agonist corticosteroid agonist
BRD-K78692225 leflunomide dihydroorotate dehydrogenase inhibitor|pdgfr tyrosine kinase receptor inhibitor dihydroorotate dehydrogenase inhibitor
BRD-K78431006 crizotinib alk tyrosine kinase receptor inhibitor alk inhibitor
BRD-K91301684 noscapine bradykinin receptor antagonist|tubulin inhibitor bradykinin receptor antagonist
BRD-K74514084 pazopanib kit inhibitor|pdgfr tyrosine kinase receptor inhibitor|vegfr inhibitor kit inhibitor
BRD-A75479906 rimantadine antiviral|rna synthesis inhibitor antiviral
BRD-K75641298 metoclopramide dopamine receptor antagonist|serotonin receptor antagonist dopamine receptor antagonist
BRD-K78126613 menadione mitochondrial dna polymerase inhibitor|phosphatase inhibitor mitochondrial dna polymerase inhibitor
BRD-K79131256 albendazole tubulin inhibitor anthelmintic
BRD-K89348303 ramipril angiotensin converting enzyme inhibitor ace inhibitor
BRD-K89162000 tandutinib flt3 inhibitor|kit inhibitor|pdgfr tyrosine kinase receptor inhibitor flt3 inhibitor
BRD-K91315211 betahistine histamine receptor agonist|histamine receptor antagonist histamine receptor agonist
BRD-K91601245 mercaptopurine immunosuppressant|protein synthesis inhibitor|purine antagonist immunosuppressant
BRD-A91699651 chloroquine antimalarial agent antimalarial
BRD-K99749624 linifanib pdgfr tyrosine kinase receptor inhibitor|vegfr inhibitor pdgfr receptor inhibitor
BRD-K92723993 imatinib bcr-abl kinase inhibitor|kit inhibitor|pdgfr tyrosine kinase receptor inhibitor bcr-abl kinase inhibitor
BRD-K93880783 stavudine dna directed dna polymerase inhibitor|reverse transcriptase inhibitor dna directed dna polymerase inhibitor
BRD-K95763993 trapidil pdgfr tyrosine kinase receptor inhibitor pdgfr receptor inhibitor
BRD-K96319534 phentermine dopamine uptake inhibitor|serotonin reuptake inhibitor dopamine uptake inhibitor
BRD-K93034159 cladribine adenosine deaminase inhibitor|ribonucleoside reductase inhibitor adenosine deaminase inhibitor
BRD-K92428153 mycophenolate-mofetil dehydrogenase inhibitor|inositol monophosphatase inhibitor dehydrogenase inhibitor
BRD-A92537424 danazol estrogen receptor antagonist|progesterone receptor agonist estrogen receptor antagonist
BRD-K92984783 melperone dopamine receptor antagonist|serotonin receptor antagonist dopamine receptor antagonist
BRD-K93460210 lamotrigine serotonin receptor antagonist|sodium channel blocker serotonin receptor antagonist
BRD-A93424738 dexamethasone-acetate glucocorticoid receptor agonist corticosteroid agonist
BRD-K93754473 tamoxifen estrogen receptor antagonist|selective estrogen receptor modulator (serm) estrogen receptor antagonist
BRD-K94830329 ataluren cftr channel agonist|dystrophin stimulant cftr channel agonist
BRD-K63750851 mycophenolic-acid dehydrogenase inhibitor|inositol monophosphatase inhibitor dehydrogenase inhibitor
BRD-K97440753 dihydroergocristine adrenergic receptor antagonist|prolactin inhibitor adrenergic receptor antagonist
BRD-A97437073 rosiglitazone insulin sensitizer|ppar receptor agonist insulin sensitizer
gwaybio commented 4 years ago

interesting, it looks like moa.y (column moa in cell_painting_dataset_moa.csv) is truncated by | and also remapped based on some dictionary somewhere.

Recode Example
by pipe dopamine receptor antagonist <pipe> serotonin receptor antagonist becomes dopamine receptor antagonist
some internal dictionary glucocorticoid receptor agonist becomes corticosteroid agonist

Before we decide on a master reference list, it would be good to understand some decisions that went into compiling this list. A couple of questions below:

My vote would be to include all information, rather than recoding. Or, possibly, do both. We can include a column moa_recode_simple or something like that, provide instructions on how the column was formed, and include both columns. An additional thing to consider depends on the meaning of the pipes (|). We could split on the pipes and duplicate compound rows to make the table longer.

e.g.

Compound moa moa_recode_simple
Magic Compound A dopamine receptor antagonist <pipe> serotonin receptor antagonist dopamine receptor antagonist

becomes:

Compound moa moa_recode_simple moa_expand
Magic Compound A dopamine receptor antagonist <pipe> serotonin receptor antagonist dopamine receptor antagonist dopamine receptor antagonist
Magic Compound A dopamine receptor antagonist <pipe> serotonin receptor antagonist dopamine receptor antagonist serotonin receptor antagonist

These things have probably already been thought deeply about. If a decision has already been made, then please disregard these musings! It will still be very good to zero in on a master list, so that many downstream projects can benefit

gwaybio commented 4 years ago

Noting important distinction here after zooming with @shntnu - the pipes (|) are just standard delimiters. They most likely just mean that the specific compound is all of the moas. It is also likely that the order between pipes is not relevant.

Although something else to possibly consider is strength of assignment - i.e. how confident are we that the specific compound is annotated with a specific moa? (maybe this is too complicated, and not really necessary though)

I am set to meet with the CLUE team over virtual office hours at 1pm tomorrow https://clue.io/office-hours

shntnu commented 4 years ago

All this comes down to resolving a question (Greg does not have access) I had back in Oct 2016: how do we explain the differences between the annotations that Steven Corsello gave us cell_painting_dataset_compound_list_anot.tsv (*) and those that were generated by querying api.clue.io cell_painting_dataset_moa.csv.

I think the best path forward to is to get the latest MOA annotations from CMap (and just double check that they have been "approved" by Steven Corsello) @gwaygenomics

(*) -- Steven's note when he sent me this file: Steven CorselloAug 31st, 2016 at 9:39 AM I joined in the compound names and MoAs. This required some manual curation as a few of the perts were not actually in the REP library (some were added to the L1000 plates for other reasons). Looks like we have a few MoAs with multiple compounds. See you in a bit!

shntnu commented 4 years ago

@gwaygenomics note that I have updated my comments above https://github.com/broadinstitute/lincs-cell-painting/issues/5#issuecomment-599520284 and https://github.com/broadinstitute/lincs-cell-painting/issues/5#issuecomment-599523129

gwaybio commented 4 years ago

Thanks @shntnu - this additional context is helpful.

I had a productive conversation with @tnat1031 at CLUE office hours this afternoon. I will summarize our conversation below (most of this is probably obvious 😄):

image

Next Steps

I think the next steps are as follows:

  1. Download the two files from CLUE and add them to this repo somewhere like metadata/
  2. Copy their access policy in the metadata/ folder README (pasted below)
  3. Include version information (date and MD5 check) in that README
  4. Add in a jupyter notebook file to merge the two files, stripping the headers, and making a single unified resource.
  5. Add in a jupyter notebook to explore the data - ask questions like (how many compounds per MOA, how many targets, basically any interesting metadata question)
  6. Settle on a processing pipeline to get the cell painting profiles online! (to be described in a separate issue)

DO I NEED TO REGISTER TO ACCESS INFORMATION FROM THE REPURPOSING HUB? No, the annotations provided in the hub are freely available for research use by any organization. The information in the Repurposing Hub may not be repackaged or redistributed for commercial purposes without permission.

shntnu commented 4 years ago

@gwaygenomics This notebook is not related to the MOA question but will be pertinent https://rpubs.com/shantanu/repurp-annotations which is the same as this notebook

shntnu commented 4 years ago

A relevant conversation thread:

Forwarded Conversation Subject: Repurposing annotations – duplicates?

From: Shantanu Singh

Hi Steven, I hope you are well. JT and I were using the repurposing collection
annotations for one of our projects, and I noticed that a small handful of
drugs have duplicate entries (same pert_ids, but different drug names, and
their MOA and target annotations don't always match) Here's a
summary:http://rpubs.com/shantanu/repurp-annotations

Do you have any advice on how to handle this? Is it reasonable to merge all
the annotations for a pert_id, as I have done in the notebook? ThanksShantanu

From: Steven Corsello

Hi Shantanu, Thanks for looking into this. Most of these issues are related to
incorrect structure/stereochemistry (or a mixture treated as a salt form) for
a registered compound. Some have been fixed (by Josh, cc'd) but we have not
generated a fresh export of the compounds IDs from CBIP for one year. The core
Broad ID changes when the structure is curated. We're planning a database
update for this fall that will also add ~800 new compounds to the library. For
now, I suggest trusting our assigned name and metadata over the structure/core
ID alone for these compounds. Josh, could you please check to see if all of
these structures are fixed? Best,Steven

From: Joshua Sacher 

Hi Shantanu, Thanks for checking through the data set and providing the
relevant code. This is actually a surprisingly small list compared to what
needs to be done, so I'm happy to go through it and update any data you need
by hand. The more interesting/difficult list is when 1 name has more than 1
Core ID (see Actinomycin-d, for instance). There will be ~400
curations/changes in the next update to begin to address this. What
information would be helpful for your experiments? The correct Broad ID for
each of these names? Josh

From: Shantanu Singh 

Hi Steven and Josh, Thanks for getting back to me so quickly.  The
inconsistencies that I found are those where the 1 core ID has multiple names
(as well as different MOAs).  I had initially assumed a different explanation
– that the multiple names are in fact synonyms of sorts, and that the
differences in the MOA annotations were just incomplete annotation. But it
looks like these are actually ambiguous mappings i.e. totally different drugs
that have the same core ID (a.k.a. pert_id), and that's what I'm hoping to
resolve. I have attached these cases (duplicate_annotations.csvsame as the
second table in the notebook). Within this set, I'm looking to resolve
annotations of 7 compounds that were tested in Cell Painting
(duplicate_pert_ids_and_broad_ids_cell_painting.csv; broad ids are included),
and the question here is: which compounds to these correspond to, given the
multiple names? In all, this a pretty tiny problem: just 7 out of 1500+
compounds we have tested in Cell Painting have this issue. So its really not a
big deal for us to drop those 7 in our analysis, but please keep me posted
when you fix it. Thanks again!

Shantanu

Notes for myself:cell_painting_compounds %>% inner_join(duplicate_pert_ids)
%>% select(-n) %>% inner_join(annotated_compound_list %>% select(pert_id,
broad_id) %>% distinct()) %>%
write_csv("duplicate_pert_ids_and_broad_ids_cell_painting.csv")

duplicate_annotations.csv.txt duplicate_pert_ids_and_broad_ids_cell_painting.csv.txt

shntnu commented 4 years ago

The first MOA listed has more literature support than the other moas. The other MOAs are "less supported"

Here's a list of compounds with multiple MOAs. It doesn't appear that this ^ is true because the MOAs are sorted lexicographically.

repurposing_annotations <- read_tsv("https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_drugs_20180907.txt", comment = "!")

repurposing_annotations %>% 
  mutate(n_moas = 1 + str_count(moa, "\\|")) %>% 
  filter(str_detect(moa, "\\|")) %>% 
  arrange(desc(n_moas)) %>% 
  select(pert_iname, moa, n_moas) %>% 
  mutate(moa = str_replace_all(moa, "\\|", ";")) %>% 
  knitr::kable() %>% 
  write_lines("~/Desktop/moa.txt")

I get the same ordered via the API e.g. https://api.clue.io/api/rep_drug_moas/?filter={%22where%22:{%22pert_iname%22:%22BMS-777607%22},%22fields%22:{%22name%22:true}}&user_key=dd910094ca286fc4ae3b174ae0ca70b1

[{"name":"AXL kinase inhibitor"},{"name":"c-Met inhibitor"},{"name":"FLT3 inhibitor"},{"name":"hepatocyte growth factor receptor inhibitor"},{"name":"macrophage migration inhibiting factor inhibitor"},{"name":"tyrosine kinase inhibitor"}]
pert_iname moa n_moas
BMS-777607 AXL kinase inhibitor;c-Met inhibitor;FLT3 inhibitor;hepatocyte growth factor receptor inhibitor;macrophage migration inhibiting factor inhibitor;tyrosine kinase inhibitor 6
dasatinib Bcr-Abl kinase inhibitor;ephrin inhibitor;KIT inhibitor;PDGFR tyrosine kinase receptor inhibitor;src inhibitor;tyrosine kinase inhibitor 6
regorafenib FGFR inhibitor;KIT inhibitor;PDGFR tyrosine kinase receptor inhibitor;RAF inhibitor;RET tyrosine kinase inhibitor;VEGFR inhibitor 6
sorafenib FLT3 inhibitor;KIT inhibitor;PDGFR tyrosine kinase receptor inhibitor;RAF inhibitor;RET tyrosine kinase inhibitor;VEGFR inhibitor 6
amuvatinib FLT3 inhibitor;KIT inhibitor;PDGFR tyrosine kinase receptor inhibitor;RAD51 inhibitor;RET tyrosine kinase inhibitor 5
dovitinib EGFR inhibitor;FGFR inhibitor;FLT3 inhibitor;PDGFR tyrosine kinase receptor inhibitor;VEGFR inhibitor 5
LY294002 mTOR inhibitor;PI3K inhibitor;DNA dependent protein kinase inhibitor;phosphodiesterase inhibitor;PLK inhibitor 5
sunitinib FLT3 inhibitor;KIT inhibitor;PDGFR tyrosine kinase receptor inhibitor;RET tyrosine kinase inhibitor;VEGFR inhibitor 5
amitriptyline norepinephrine inhibitor;norepinephrine reuptake inhibitor;serotonin receptor antagonist;serotonin�norepinephrine reuptake inhibitor (SNRI) 4
caffeic-acid HIV integrase inhibitor;lipoxygenase inhibitor;nitric oxide production inhibitor;tumor necrosis factor production inhibitor 4
shntnu commented 4 years ago

Clarifying note from Ted:

Yes, I think in the REP hub the MoAs are lexicographically ordered. When I spoke to Greg on Tuesday, he was also using a file of MoA annotations that he (or maybe you) had gotten from CMap a while back, and in that file we had ordered them according to how much literature support each had. If it helps, I think if there's any discrepancy between the MoAs for a given compound, the REP hub data should be considered the authority.

gwaybio commented 4 years ago

@shntnu - in #7 I begin processing the CLUE repurposing data. One additional question: it looks like the pert_id column is not provided. Is this something I should add? I imagine it would simply be truncating the broad_id column (e.g. BRD-K89787693-001-01-1 becomes BRD-K89787693)

I can then also create a third file (something like repurposing_info_simple.tsv) that only has unique columns pert_id, pert_iname, moa, and target. This file would be useful, because, presumably pert_id is the primary mapping between the CLUE resources and the Cell Painting profiles. cc @niranjchandrasekaran

shntnu commented 4 years ago

@shntnu - in #7 I begin processing the CLUE repurposing data. One additional question: it looks like the pert_id column is not provided. Is this something I should add? I imagine it would simply be truncating the broad_id column (e.g. BRD-K89787693-001-01-1 becomes BRD-K89787693)

Yes, the first two components of the broad_id are the pert_id (defined here A unique identifier for a perturbagen that refers to the perturbagen in general, not to any particular batch or sample.)

I can then also create a third file (something like repurposing_info_simple.tsv) that only has unique columns pert_id, pert_iname, moa, and target. This file would be useful, because, presumably pert_id is the primary mapping between the CLUE resources and the Cell Painting profiles. cc @niranjchandrasekaran

This seems reasonable. Note that with JUMP CP, we ended up with a proliferation of such derived files (but that was because we were dealing with many sources), and it can get confusing after a while (ideally, we'd design it like a relational database). But this will likely be the only derived metadata file, so that's ok.

gwaybio commented 4 years ago

the perturbation metadata file is added in #7

Based on discussions with @shntnu , @niranjchandrasekaran , @jrsacher , and @tnat1031 we've settled on repurposing hub metadata resources and decided to add 3 perturbation metadata files to the lincs_cell_painting repo.

Thanks everyone for the speedy contributions and comprehensive documentation!

Below is a description of the three resulting output files after processing. I will close this issue now that it is addressed in #7

Metadata Resources Added

Filename Description Derivation Dimensions
repurposing_info.tsv The primary file storing every column in repurposing hub drugs and samples metadata file Inner join on drugs and samples pert_iname column 13,553 x 17
repurposing_info_long.tsv Stores every unique perturbation with singular values in target and moa columns Split aforementioned columns by pipe (e.g. entry ACT1<pipe>ACT3 becomes two independent rows) 39,471 x 20
repurposing_simple.tsv Core identifiers and biological metadata Splits off aliquot and batch information from Broad ID to create a pert_id and drops duplicate entries in target, moa, and pert_iname columns 6,806 x 4

TODO

Note, steps 5 and 6 outlined in https://github.com/broadinstitute/lincs-cell-painting/issues/5#issuecomment-600208689 still need to be addressed:

  1. Add in a jupyter notebook to explore the data - ask questions like (how many compounds per MOA, how many targets, basically any interesting metadata question)
  2. Settle on a processing pipeline to get the cell painting profiles online! (to be described in a separate issue)

I will document these steps in a different issue and PR