Closed gwaybio closed 4 years ago
We now need to decide how to report compound metadata information. In the past, i've used pert_info.txt
(e.g. broadinstitute/cell-health#111). For completeness, my vote would be to combine pert_info.txt
with information from the following columns: pert_iname_modified
, source
, url
, support
, and num_sources
.
We can also add metadata for all compounds and add a boolean column (maybe cell_painted_A549
) to denote which compounds we screened.
Also definitely open to suggestions and discussion!
cc @shntnu
I'm looking up annotations on the Broad cluster
The annotations are going to be under the batch 2016_04_01_a549_48hr_batch1
but there are multiple folders corresponding to this batch because of the different ways we processed the data.
But there are only two folders 2016_04_01_a549_48hr_batch1
and 2016_04_01_a549_48hr_batch1_cmap_style
that have annotation files. I compare the relevant files below.
metadata$ cd /cmap/imaging/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/metadata
metadata$
metadata$ diff \
> ./2016_04_01_a549_48hr_batch1/barcode_platemap.csv \
> ./2016_04_01_a549_48hr_batch1_cmap_style/barcode_platemap.csv
metadata$
metadata$ diff \
> ./2016_04_01_a549_48hr_batch1/cell_painting_dataset_moa.csv \
> ./2016_04_01_a549_48hr_batch1_cmap_style/cell_painting_dataset_moa.csv
metadata$
metadata$ diff \
> ./2016_04_01_a549_48hr_batch1_cmap_style/cell_painting_dataset_cmap_annotations_moa.csv \
> ./2016_04_01_a549_48hr_batch1/cell_painting_dataset_cmap_annotations_moa.csv
metadata$
The README has an explanation of what's what
2016_04_01_a549_48hr_batch1$ cd /cmap/imaging/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/metadata/2016_04_01_a549_48hr_batch1
2016_04_01_a549_48hr_batch1$ cat README
cell_painting_dataset_compound_list_anot.tsv Annotation provided by Steven Corsello <corsello@broadinstitute.org>
cell_painting_dataset_moa.csv Annotation obtained by querying Repurposing Hub (2016-12-08)
cpd_moa_and_gene_target_n90.txt Annotation provided by Lev Litichevskiy <lev@broadinstitute.org>
cell_painting_dataset_cmap_annotations_moa.csv Annotations copied from /cmap/projects/M1/annotation/pert_info_a2.txt
Note that these files were also shared with the PERISCOPE team here
Looking the run log for this dataset here, it looks like we annotated the profiles using cell_painting_dataset_moa.csv
but then when running CMap's tools we decided to use the latter.
Side note: this is how cell_painting_dataset_cmap_annotations_moa.csv
was created.
And this is how cell_painting_dataset_moa.csv
was created i.e. by querying api.clue.io
So now the only thing to resolve is the difference between cell_painting_dataset_cmap_annotations_moa.csv
and cell_painting_dataset_moa.csv
. More on this next.
Here is how they differ
First note that cell_painting_dataset_cmap_annotations_moa
has only a subset of the annotations, so we can already conclude that between these two, cell_painting_dataset_moa
is the more complete set.
na.omit(cell_painting_dataset_moa) %>% nrow()
#> [1] 1549
na.omit(cell_painting_dataset_cmap_annotations_moa) %>% nrow()
#> [1] 605
inner_join(na.omit(cell_painting_dataset_moa), na.omit(cell_painting_dataset_cmap_annotations_moa), by = c("pert_id", "pert_iname")) %>% nrow()
#> [1] 605
inner_join(na.omit(cell_painting_dataset_moa) %>% mutate(moa = tolower(moa)), na.omit(cell_painting_dataset_cmap_annotations_moa) %>% mutate(moa = tolower(moa)), by = c("pert_id", "pert_iname")) %>% filter(moa.x != moa.y) %>% nrow()
#> [1] 146
Created on 2020-03-16 by the reprex package (v0.3.0)
inner_join(na.omit(cell_painting_dataset_moa) %>% mutate(moa = tolower(moa)), na.omit(cell_painting_dataset_cmap_annotations_moa) %>% mutate(moa = tolower(moa)), by = c("pert_id", "pert_iname")) %>% filter(moa.x != moa.y) %>% knitr::kable()
The two files disagree on 146/605 compounds. Some of these are differences are trivial (e.g. quinapril
and ramipril
. But most are of the category where only the first MOA listed was chosen.
I verified that for all compounds where the MOAs agree, there was only one MOA listed per compound
> inner_join(na.omit(cell_painting_dataset_moa) %>% mutate(moa = tolower(moa)), na.omit(cell_painting_dataset_cmap_annotations_moa) %>% mutate(moa = tolower(moa)), by = c("pert_id", "pert_iname")) %>% filter(moa.x == moa.y) %>% pull(moa.x) %>% map_lgl(~str_detect(.x, "\\|")) %>% any()
[1] FALSE
For Broad's internal reference: we have a conversation about this in CMap's Slack channel here
pert_id | pert_iname | moa.x | moa.y |
---|---|---|---|
BRD-K25650355 | physostigmine | cholinesterase inhibitor | acetylcholinesterase inhibitor |
BRD-K26657438 | imiquimod | interferon inducer|tlr agonist | interferon inducer |
BRD-K28542495 | benzydamine | prostanoid receptor antagonist | membrane integrity inhibitor |
BRD-K29359156 | ebselen | cyclooxygenase inhibitor|glutathione peroxidase agonist|h+/k+-atpase inhibitor|nitric oxide synthase inhibitor | cyclooxygenase inhibitor |
BRD-K29582115 | ziprasidone | dopamine receptor antagonist|serotonin receptor antagonist | dopamine receptor antagonist |
BRD-K29653726 | topiramate | carbonic anhydrase inhibitor|glutamate receptor antagonist|kainate receptor antagonist | carbonic anhydrase inhibitor |
BRD-K30480208 | torasemide | electrolyte reabsorption inhibitor|thromboxane receptor antagonist | electrolyte reabsorption inhibitor |
BRD-K15916496 | clotrimazole | cytochrome p450 inhibitor|imidazoline ligand | cytochrome p450 inhibitor |
BRD-K16444452 | ibudilast | leukotriene receptor antagonist|phosphodiesterase inhibitor | leukotriene receptor antagonist |
BRD-A17883755 | lenalidomide | anticancer agent | antineoplastic |
BRD-K99964838 | bosutinib | abl kinase inhibitor|bcr-abl kinase inhibitor|src inhibitor | abl inhibitor |
BRD-K12184916 | NVP-BEZ235 | mtor inhibitor|pi3k inhibitor | mtor inhibitor |
BRD-K32744045 | disulfiram | aldehyde dehydrogenase inhibitor|dna methyltransferase inhibitor|trpv agonist | aldehyde dehydrogenase inhibitor |
BRD-K33211335 | dextromethorphan | glutamate receptor antagonist|sigma receptor agonist | glutamate receptor antagonist |
BRD-A33447119 | oxfendazole | anthelmintic agent | anthelmintic |
BRD-A19195498 | trimipramine | norepinephrine reputake inhibitor|tricyclic antidepressant | norepinephrine reuptake inhibitor |
BRD-K19687926 | lapatinib | egfr inhibitor|erbb2 inhibitor | egfr inhibitor |
BRD-K25433859 | maprotiline | norepinephrine reputake inhibitor|tricyclic antidepressant | norepinephrine reuptake inhibitor |
BRD-K07881437 | danusertib | aurora kinase inhibitor|growth factor receptor inhibitor | aurora kinase inhibitor |
BRD-K08206212 | entecavir | dna replication inhibitor|reverse transcriptase inhibitor | dna replication inhibitor |
BRD-K10843433 | phenylbutazone | cyclooxygenase inhibitor|prostanoid receptor antagonist | cyclooxygenase inhibitor |
BRD-K10670311 | sulfasalazine | antirheumatic drug|nfkb pathway inhibitor | antirheumatic |
BRD-K80738081 | resveratrol | cytochrome p450 inhibitor|sirt activator | cytochrome p450 inhibitor |
BRD-K18895904 | olanzapine | dopamine receptor antagonist|serotonin receptor antagonist | dopamine receptor antagonist |
BRD-A18917088 | estradiol | estrogen receptor agonist | contraceptive agent |
BRD-K19462402 | buflomedil | adrenergic receptor antagonist|calcium channel blocker | adrenergic receptor antagonist |
BRD-A21858158 | praziquantel | anthelmintic agent | anthelmintic |
BRD-A24191444 | ifenprodil | adrenergic receptor antagonist|glutamate receptor antagonist | adrenergic receptor antagonist |
BRD-K23984367 | sorafenib | flt3 inhibitor|kit inhibitor|pdgfr tyrosine kinase receptor inhibitor|raf inhibitor|ret tyrosine kinase inhibitor|vegfr inhibitor | flt3 inhibitor |
BRD-K24576554 | AT-9283 | aurora kinase inhibitor|jak inhibitor | abl inhibitor |
BRD-A42759514 | ornidazole | antiprotozoal agent | antiprotozoal |
BRD-K44442813 | pidotimod | interferon receptor agonist|interleukin receptor agonist | interferon receptor agonist |
BRD-A44448661 | pentobarbital | barbiturate antiepileptic|gaba receptor modulator | barbiturate antiepileptic |
BRD-K51350053 | toremifene | estrogen receptor antagonist|selective estrogen receptor modulator (serm) | estrogen receptor antagonist |
BRD-A51382177 | fosinopril | angiotensin converting enzyme inhibitor | ace inhibitor |
BRD-K52989797 | clomipramine | serotonin transporter (sert) inhibitor | serotonin transporter inhibitor (sert) |
BRD-K53737926 | amitriptyline | norepinephrine inhibitor|norepinephrine reuptake inhibitor|serotonin receptor antagonist|serotonin reuptake inhibitor | norepinephrine inhibitor |
BRD-K54759182 | dosulepin | norepinephrine reuptake inhibitor|serotonin reuptake inhibitor|tricyclic antidepressant | norepinephrine reuptake inhibitor |
BRD-A56359832 | zileuton | leukotriene inhibitor|lipoxygenase inhibitor | leukotriene inhibitor |
BRD-K67868012 | PI-103 | mtor inhibitor|pi3k inhibitor | mtor inhibitor |
BRD-K51033547 | tramadol | norepinephrine reputake inhibitor|opioid receptor agonist|serotonin reuptake inhibitor | norepinephrine reuptake inhibitor |
BRD-A68009927 | daunorubicin | rna synthesis inhibitor|topoisomerase inhibitor | rna synthesis inhibitor |
BRD-K35960502 | niclosamide | dna replication inhibitor|stat inhibitor | dna replication inhibitor |
BRD-K36927236 | glyburide | atp channel blocker|insulin secretagogue|sulfonylurea | sulfonylurea |
BRD-K92731339 | perindopril | angiotensin converting enzyme inhibitor | ace inhibitor |
BRD-A89585551 | mefloquine | adenosine receptor antagonist|hemoglobin antagonist | adenosine receptor antagonist |
BRD-K90789829 | nefazodone | adrenergic inhibitor|norepinephrine reuptake inhibitor|serotonin receptor antagonist|serotonin reuptake inhibitor | adrenergic inhibitor |
BRD-K31342827 | GF109203X | pkc inhibitor | cdk inhibitor |
BRD-K28428262 | brivanib | fgfr inhibitor|vegfr inhibitor | fgfr inhibitor |
BRD-K32107296 | temozolomide | dna alkylating drug | dna alkylating agent |
BRD-A34255068 | rolipram | phosphodiesterase inhibitor | interleukin receptor antagonist |
BRD-A13084692 | troglitazone | insulin sensitizer|ppar receptor agonist | insulin sensitizer |
BRD-K13646352 | midostaurin | flt3 inhibitor|kit inhibitor|pkc inhibitor | flt3 inhibitor |
BRD-K13819402 | desoxypeganine | acetylcholinesterase inhibitor|monoamine oxidase inhibitor | acetylcholinesterase inhibitor |
BRD-K01638814 | rilmenidine | adrenergic receptor agonist|imidazoline receptor agonist | adrenergic receptor agonist |
BRD-K06902185 | minoxidil | katp activator|kir6 channel (katp) activator|vasodilator | katp activator |
BRD-K01815685 | indole-3-carbinol | aryl hydrocarbon receptor agonist|indoleamine 2,3-dioxygenase inhibitor | aryl hydrocarbon receptor agonist |
BRD-K02265150 | amoxapine | norepinephrine reputake inhibitor | norepinephrine reuptake inhibitor |
BRD-A02367930 | ethinyl-estradiol | estrogen receptor agonist|estrogenic component in oral contraceptives | dna directed dna polymerase stimulant |
BRD-K02404261 | caffeine | adenosine receptor antagonist|phosphodiesterase inhibitor | adenosine receptor antagonist |
BRD-A03216249 | mepivacaine | potassium channel blocker|sodium channel blocker | potassium channel blocker |
BRD-K03384561 | roquinimex | angiogenesis inhibitor|tnf production inhibitor | angiogenesis inhibitor |
BRD-K37289225 | clozapine | dopamine receptor antagonist|serotonin receptor antagonist | dopamine receptor antagonist |
BRD-K37846922 | 3,3'-diindolylmethane | chk inhibitor|cytochrome p450 activator|indoleamine 2,3-dioxygenase inhibitor | chk inhibitor |
BRD-K38436528 | imipramine | norepinephrine reputake inhibitor|serotonin reuptake inhibitor | norepinephrine reuptake inhibitor |
BRD-K41260949 | divalproex-sodium | gaba receptor agonist | hdac inhibitor |
BRD-A43082555 | loxoprofen | cyclooxygenase inhibitor|prostanoid receptor antagonist | cyclooxygenase inhibitor |
BRD-A44090213 | indoprofen | cyclooxygenase inhibitor|prostanoid receptor antagonist | cyclooxygenase inhibitor |
BRD-A43882281 | pinacidil | atp channel activator|potassium channel activator | atp channel activator |
BRD-K55966568 | orantinib | fgfr inhibitor|pdgfr tyrosine kinase receptor inhibitor|vegfr inhibitor | fgfr inhibitor |
BRD-A58955223 | sulforaphane | anticancer agent|aryl hydrocarbon receptor antagonist | antineoplastic |
BRD-A39390670 | rabeprazole | atpase inhibitor|gastrin inhibitor | atpase inhibitor |
BRD-K41170226 | deoxycholic-acid | biliverdin reductase a activator|g protein-coupled receptor agonist | biliverdin reductase a activator |
BRD-A55913614 | primaquine | antimalarial agent|dna inhibitor | antimalarial |
BRD-K57545991 | enalapril | angiotensin converting enzyme inhibitor | ace inhibitor |
BRD-K59273480 | propentofylline | adenosine reuptake inhibitor|phosphodiesterase inhibitor | adenosine reuptake inhibitor |
BRD-K59369769 | tozasertib | aurora kinase inhibitor|bcr-abl kinase inhibitor|flt3 inhibitor|jak inhibitor | aurora kinase inhibitor |
BRD-K60237333 | niacin | nad precursor with lipid lowering effects|vitamin b | nad precursor with lipid lowering effects |
BRD-K51677086 | erythromycin-ethylsuccinate | cytochrome p450 inhibitor|protein synthesis inhibitor | cytochrome p450 inhibitor |
BRD-A51714012 | venlafaxine | adrenergic inhibitor|norepinephrine reuptake inhibitor|serotonin reuptake inhibitor | adrenergic inhibitor |
BRD-K53318339 | vinpocetine | phosphodiesterase inhibitor|sodium channel blocker | phosphodiesterase inhibitor |
BRD-K53857191 | risperidone | dopamine receptor antagonist|serotonin receptor antagonist | dopamine receptor antagonist |
BRD-K54416256 | methimazole | antithyroid agent | antithyroid |
BRD-K07572174 | curcumin | cyclooxygenase inhibitor|histone acetyltransferase inhibitor|lipoxygenase inhibitor|nfkb pathway inhibitor | cyclooxygenase inhibitor |
BRD-A48430263 | pioglitazone | insulin sensitizer|ppar receptor agonist | insulin sensitizer |
BRD-A50311610 | meclizine | constitutive androstane receptor (car) agonist | car agonist |
BRD-A64977602 | mirtazapine | adrenergic receptor antagonist|serotonin receptor antagonist | adrenergic receptor antagonist |
BRD-K63828191 | raloxifene | estrogen receptor antagonist|selective estrogen receptor modulator (serm) | estrogen receptor antagonist |
BRD-K67277431 | picotamide | thromboxane receptor antagonist|thromboxane synthase inhibitor | thromboxane receptor antagonist |
BRD-A48237631 | mitomycin-c | dna alkylating agent|dna synthesis inhibitor | dna alkylating agent |
BRD-K48300629 | zonisamide | sodium channel blocker|t-type calcium channel blocker | sodium channel blocker |
BRD-K49328571 | dasatinib | bcr-abl kinase inhibitor|ephrin inhibitor|kit inhibitor|pdgfr tyrosine kinase receptor inhibitor|src inhibitor|tyrosine kinase inhibitor | bcr-abl kinase inhibitor |
BRD-K49865102 | PD-0325901 | mek inhibitor | map kinase inhibitor |
BRD-K50422030 | clomethiazole | gaba receptor antagonist|gaba receptor modulator | gaba receptor antagonist |
BRD-A50675702 | fipronil | chloride channel blocker|gaba gated chloride channel blocker | chloride channel blocker |
BRD-K50398167 | meclofenamic-acid | cyclooxygenase inhibitor|prostanoid receptor antagonist | cyclooxygenase inhibitor |
BRD-A83081521 | finasteride | 5 alpha reductase inhibitor | 5-alpha reductase inhibitor |
BRD-K86930074 | cediranib | kit inhibitor|vegfr inhibitor | kit inhibitor |
BRD-A45889380 | mepacrine | cytokine production inhibitor|nfkb pathway inhibitor|tp53 activator | cytokine production inhibitor |
BRD-K47869605 | podophyllotoxin | microtubule inhibitor|tubulin inhibitor | microtubule inhibitor |
BRD-K81528515 | nilotinib | abl kinase inhibitor|bcr-abl kinase inhibitor | abl inhibitor |
BRD-K81473089 | tacrine | acetylcholinesterase inhibitor | acetylcholine release stimulant |
BRD-K76908866 | CP-724714 | egfr inhibitor|receptor tyrosine protein kinase inhibitor | egfr inhibitor |
BRD-K68103045 | CGS-20625 | benzodiazepine receptor agonist|gaba benzodiazepine site receptor partial agonist | benzodiazepine receptor agonist |
BRD-K70358946 | aripiprazole | serotonin receptor agonist|serotonin receptor antagonist | serotonin receptor agonist |
BRD-K70778732 | trazodone | adrenergic receptor antagonist|serotonin receptor antagonist|serotonin reuptake inhibitor | adrenergic receptor antagonist |
BRD-K72222507 | quinapril | angiotensin converting enzyme inhibitor | ace inhibitor |
BRD-K68488863 | ENMD-2076 | aurora kinase inhibitor|flt3 inhibitor|vegfr inhibitor | aurora kinase inhibitor |
BRD-K70557564 | zosuquidar | p glycoprotein inhibitor | p-glycoprotein inhibitor |
BRD-K71035033 | masitinib | kit inhibitor|pdgfr tyrosine kinase receptor inhibitor|src inhibitor | kit inhibitor |
BRD-A68281735 | REV-5901 | leukotriene receptor antagonist|lipoxygenase inhibitor | leukotriene receptor antagonist |
BRD-K68867920 | quetiapine | dopamine receptor antagonist|serotonin receptor antagonist | dopamine receptor antagonist |
BRD-K35189033 | levonorgestrel | estrogen receptor agonist|glucocorticoid receptor antagonist|progesterone receptor agonist|progesterone receptor antagonist | estrogen receptor agonist |
BRD-A70083328 | secnidazole | acetylcholinesterase inhibitor|microtubule inhibitor | acetylcholinesterase inhibitor |
BRD-K71103788 | duloxetine | norepinephrine reuptake inhibitor|serotonin reuptake inhibitor | norepinephrine reuptake inhibitor |
BRD-K74305673 | IKK-2-inhibitor-V | ikk inhibitor|nfkb pathway inhibitor | ikk inhibitor |
BRD-A75172220 | hydrocortisone | glucocorticoid receptor agonist | corticosteroid agonist |
BRD-K78692225 | leflunomide | dihydroorotate dehydrogenase inhibitor|pdgfr tyrosine kinase receptor inhibitor | dihydroorotate dehydrogenase inhibitor |
BRD-K78431006 | crizotinib | alk tyrosine kinase receptor inhibitor | alk inhibitor |
BRD-K91301684 | noscapine | bradykinin receptor antagonist|tubulin inhibitor | bradykinin receptor antagonist |
BRD-K74514084 | pazopanib | kit inhibitor|pdgfr tyrosine kinase receptor inhibitor|vegfr inhibitor | kit inhibitor |
BRD-A75479906 | rimantadine | antiviral|rna synthesis inhibitor | antiviral |
BRD-K75641298 | metoclopramide | dopamine receptor antagonist|serotonin receptor antagonist | dopamine receptor antagonist |
BRD-K78126613 | menadione | mitochondrial dna polymerase inhibitor|phosphatase inhibitor | mitochondrial dna polymerase inhibitor |
BRD-K79131256 | albendazole | tubulin inhibitor | anthelmintic |
BRD-K89348303 | ramipril | angiotensin converting enzyme inhibitor | ace inhibitor |
BRD-K89162000 | tandutinib | flt3 inhibitor|kit inhibitor|pdgfr tyrosine kinase receptor inhibitor | flt3 inhibitor |
BRD-K91315211 | betahistine | histamine receptor agonist|histamine receptor antagonist | histamine receptor agonist |
BRD-K91601245 | mercaptopurine | immunosuppressant|protein synthesis inhibitor|purine antagonist | immunosuppressant |
BRD-A91699651 | chloroquine | antimalarial agent | antimalarial |
BRD-K99749624 | linifanib | pdgfr tyrosine kinase receptor inhibitor|vegfr inhibitor | pdgfr receptor inhibitor |
BRD-K92723993 | imatinib | bcr-abl kinase inhibitor|kit inhibitor|pdgfr tyrosine kinase receptor inhibitor | bcr-abl kinase inhibitor |
BRD-K93880783 | stavudine | dna directed dna polymerase inhibitor|reverse transcriptase inhibitor | dna directed dna polymerase inhibitor |
BRD-K95763993 | trapidil | pdgfr tyrosine kinase receptor inhibitor | pdgfr receptor inhibitor |
BRD-K96319534 | phentermine | dopamine uptake inhibitor|serotonin reuptake inhibitor | dopamine uptake inhibitor |
BRD-K93034159 | cladribine | adenosine deaminase inhibitor|ribonucleoside reductase inhibitor | adenosine deaminase inhibitor |
BRD-K92428153 | mycophenolate-mofetil | dehydrogenase inhibitor|inositol monophosphatase inhibitor | dehydrogenase inhibitor |
BRD-A92537424 | danazol | estrogen receptor antagonist|progesterone receptor agonist | estrogen receptor antagonist |
BRD-K92984783 | melperone | dopamine receptor antagonist|serotonin receptor antagonist | dopamine receptor antagonist |
BRD-K93460210 | lamotrigine | serotonin receptor antagonist|sodium channel blocker | serotonin receptor antagonist |
BRD-A93424738 | dexamethasone-acetate | glucocorticoid receptor agonist | corticosteroid agonist |
BRD-K93754473 | tamoxifen | estrogen receptor antagonist|selective estrogen receptor modulator (serm) | estrogen receptor antagonist |
BRD-K94830329 | ataluren | cftr channel agonist|dystrophin stimulant | cftr channel agonist |
BRD-K63750851 | mycophenolic-acid | dehydrogenase inhibitor|inositol monophosphatase inhibitor | dehydrogenase inhibitor |
BRD-K97440753 | dihydroergocristine | adrenergic receptor antagonist|prolactin inhibitor | adrenergic receptor antagonist |
BRD-A97437073 | rosiglitazone | insulin sensitizer|ppar receptor agonist | insulin sensitizer |
interesting, it looks like moa.y
(column moa
in cell_painting_dataset_moa.csv
) is truncated by |
and also remapped based on some dictionary somewhere.
Recode | Example |
---|---|
by pipe |
dopamine receptor antagonist <pipe> serotonin receptor antagonist becomes dopamine receptor antagonist |
some internal dictionary | glucocorticoid receptor agonist becomes corticosteroid agonist |
Before we decide on a master reference list, it would be good to understand some decisions that went into compiling this list. A couple of questions below:
/cmap/projects/M1/annotation/pert_info_a2.txt
?|
" loses equivalent information as reccoding by "some internal dictionary"?|
meaningful? i.e. is it safe that the first one selected is the most representative?|
list meaningful?
serotonin receptor antagonist
entry in norepinephrine inhibitor|norepinephrine reuptake inhibitor|serotonin receptor antagonist|serotonin reuptake inhibitor
different than the serotonin receptor antagonist
entry in dopamine receptor antagonist|serotonin receptor antagonist
? My vote would be to include all information, rather than recoding. Or, possibly, do both. We can include a column moa_recode_simple
or something like that, provide instructions on how the column was formed, and include both columns. An additional thing to consider depends on the meaning of the pipes (|
). We could split on the pipes and duplicate compound rows to make the table longer.
e.g.
Compound | moa | moa_recode_simple |
---|---|---|
Magic Compound A | dopamine receptor antagonist <pipe> serotonin receptor antagonist |
dopamine receptor antagonist |
becomes:
Compound | moa | moa_recode_simple | moa_expand |
---|---|---|---|
Magic Compound A | dopamine receptor antagonist <pipe> serotonin receptor antagonist |
dopamine receptor antagonist |
dopamine receptor antagonist |
Magic Compound A | dopamine receptor antagonist <pipe> serotonin receptor antagonist |
dopamine receptor antagonist |
serotonin receptor antagonist |
These things have probably already been thought deeply about. If a decision has already been made, then please disregard these musings! It will still be very good to zero in on a master list, so that many downstream projects can benefit
Noting important distinction here after zooming with @shntnu - the pipes (|
) are just standard delimiters. They most likely just mean that the specific compound is all of the moas. It is also likely that the order between pipes is not relevant.
Although something else to possibly consider is strength of assignment - i.e. how confident are we that the specific compound is annotated with a specific moa? (maybe this is too complicated, and not really necessary though)
I am set to meet with the CLUE team over virtual office hours at 1pm tomorrow https://clue.io/office-hours
All this comes down to resolving a question (Greg does not have access) I had back in Oct 2016: how do we explain the differences between the annotations that Steven Corsello gave us cell_painting_dataset_compound_list_anot.tsv
(*) and those that were generated by querying api.clue.io cell_painting_dataset_moa.csv
.
I think the best path forward to is to get the latest MOA annotations from CMap (and just double check that they have been "approved" by Steven Corsello) @gwaygenomics
(*) -- Steven's note when he sent me this file: Steven CorselloAug 31st, 2016 at 9:39 AM I joined in the compound names and MoAs. This required some manual curation as a few of the perts were not actually in the REP library (some were added to the L1000 plates for other reasons). Looks like we have a few MoAs with multiple compounds. See you in a bit!
@gwaygenomics note that I have updated my comments above https://github.com/broadinstitute/lincs-cell-painting/issues/5#issuecomment-599520284 and https://github.com/broadinstitute/lincs-cell-painting/issues/5#issuecomment-599523129
Thanks @shntnu - this additional context is helpful.
I had a productive conversation with @tnat1031 at CLUE office hours this afternoon. I will summarize our conversation below (most of this is probably obvious 😄):
repurposing_samples
and repurposing_drugs
. The samples file contains extended Broad IDs (containing batch and aliquot information) and the drugs file contains MOAs and targets (!)( 👈 that is awesome!)|
delimiter in the MOA file is important (it is not clear if this is also true for the target
column)
I think the next steps are as follows:
metadata/
metadata/
folder README (pasted below)DO I NEED TO REGISTER TO ACCESS INFORMATION FROM THE REPURPOSING HUB? No, the annotations provided in the hub are freely available for research use by any organization. The information in the Repurposing Hub may not be repackaged or redistributed for commercial purposes without permission.
@gwaygenomics This notebook is not related to the MOA question but will be pertinent https://rpubs.com/shantanu/repurp-annotations which is the same as this notebook
A relevant conversation thread:
Forwarded Conversation Subject: Repurposing annotations – duplicates?
From: Shantanu Singh
Hi Steven, I hope you are well. JT and I were using the repurposing collection
annotations for one of our projects, and I noticed that a small handful of
drugs have duplicate entries (same pert_ids, but different drug names, and
their MOA and target annotations don't always match) Here's a
summary:http://rpubs.com/shantanu/repurp-annotations
Do you have any advice on how to handle this? Is it reasonable to merge all
the annotations for a pert_id, as I have done in the notebook? ThanksShantanu
From: Steven Corsello
Hi Shantanu, Thanks for looking into this. Most of these issues are related to
incorrect structure/stereochemistry (or a mixture treated as a salt form) for
a registered compound. Some have been fixed (by Josh, cc'd) but we have not
generated a fresh export of the compounds IDs from CBIP for one year. The core
Broad ID changes when the structure is curated. We're planning a database
update for this fall that will also add ~800 new compounds to the library. For
now, I suggest trusting our assigned name and metadata over the structure/core
ID alone for these compounds. Josh, could you please check to see if all of
these structures are fixed? Best,Steven
From: Joshua Sacher
Hi Shantanu, Thanks for checking through the data set and providing the
relevant code. This is actually a surprisingly small list compared to what
needs to be done, so I'm happy to go through it and update any data you need
by hand. The more interesting/difficult list is when 1 name has more than 1
Core ID (see Actinomycin-d, for instance). There will be ~400
curations/changes in the next update to begin to address this. What
information would be helpful for your experiments? The correct Broad ID for
each of these names? Josh
From: Shantanu Singh
Hi Steven and Josh, Thanks for getting back to me so quickly. The
inconsistencies that I found are those where the 1 core ID has multiple names
(as well as different MOAs). I had initially assumed a different explanation
– that the multiple names are in fact synonyms of sorts, and that the
differences in the MOA annotations were just incomplete annotation. But it
looks like these are actually ambiguous mappings i.e. totally different drugs
that have the same core ID (a.k.a. pert_id), and that's what I'm hoping to
resolve. I have attached these cases (duplicate_annotations.csvsame as the
second table in the notebook). Within this set, I'm looking to resolve
annotations of 7 compounds that were tested in Cell Painting
(duplicate_pert_ids_and_broad_ids_cell_painting.csv; broad ids are included),
and the question here is: which compounds to these correspond to, given the
multiple names? In all, this a pretty tiny problem: just 7 out of 1500+
compounds we have tested in Cell Painting have this issue. So its really not a
big deal for us to drop those 7 in our analysis, but please keep me posted
when you fix it. Thanks again!
Shantanu
Notes for myself:cell_painting_compounds %>% inner_join(duplicate_pert_ids)
%>% select(-n) %>% inner_join(annotated_compound_list %>% select(pert_id,
broad_id) %>% distinct()) %>%
write_csv("duplicate_pert_ids_and_broad_ids_cell_painting.csv")
duplicate_annotations.csv.txt duplicate_pert_ids_and_broad_ids_cell_painting.csv.txt
The first MOA listed has more literature support than the other moas. The other MOAs are "less supported"
Here's a list of compounds with multiple MOAs. It doesn't appear that this ^ is true because the MOAs are sorted lexicographically.
repurposing_annotations <- read_tsv("https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_drugs_20180907.txt", comment = "!")
repurposing_annotations %>%
mutate(n_moas = 1 + str_count(moa, "\\|")) %>%
filter(str_detect(moa, "\\|")) %>%
arrange(desc(n_moas)) %>%
select(pert_iname, moa, n_moas) %>%
mutate(moa = str_replace_all(moa, "\\|", ";")) %>%
knitr::kable() %>%
write_lines("~/Desktop/moa.txt")
I get the same ordered via the API e.g. https://api.clue.io/api/rep_drug_moas/?filter={%22where%22:{%22pert_iname%22:%22BMS-777607%22},%22fields%22:{%22name%22:true}}&user_key=dd910094ca286fc4ae3b174ae0ca70b1
[{"name":"AXL kinase inhibitor"},{"name":"c-Met inhibitor"},{"name":"FLT3 inhibitor"},{"name":"hepatocyte growth factor receptor inhibitor"},{"name":"macrophage migration inhibiting factor inhibitor"},{"name":"tyrosine kinase inhibitor"}]
pert_iname | moa | n_moas |
---|---|---|
BMS-777607 | AXL kinase inhibitor;c-Met inhibitor;FLT3 inhibitor;hepatocyte growth factor receptor inhibitor;macrophage migration inhibiting factor inhibitor;tyrosine kinase inhibitor | 6 |
dasatinib | Bcr-Abl kinase inhibitor;ephrin inhibitor;KIT inhibitor;PDGFR tyrosine kinase receptor inhibitor;src inhibitor;tyrosine kinase inhibitor | 6 |
regorafenib | FGFR inhibitor;KIT inhibitor;PDGFR tyrosine kinase receptor inhibitor;RAF inhibitor;RET tyrosine kinase inhibitor;VEGFR inhibitor | 6 |
sorafenib | FLT3 inhibitor;KIT inhibitor;PDGFR tyrosine kinase receptor inhibitor;RAF inhibitor;RET tyrosine kinase inhibitor;VEGFR inhibitor | 6 |
amuvatinib | FLT3 inhibitor;KIT inhibitor;PDGFR tyrosine kinase receptor inhibitor;RAD51 inhibitor;RET tyrosine kinase inhibitor | 5 |
dovitinib | EGFR inhibitor;FGFR inhibitor;FLT3 inhibitor;PDGFR tyrosine kinase receptor inhibitor;VEGFR inhibitor | 5 |
LY294002 | mTOR inhibitor;PI3K inhibitor;DNA dependent protein kinase inhibitor;phosphodiesterase inhibitor;PLK inhibitor | 5 |
sunitinib | FLT3 inhibitor;KIT inhibitor;PDGFR tyrosine kinase receptor inhibitor;RET tyrosine kinase inhibitor;VEGFR inhibitor | 5 |
amitriptyline | norepinephrine inhibitor;norepinephrine reuptake inhibitor;serotonin receptor antagonist;serotonin�norepinephrine reuptake inhibitor (SNRI) | 4 |
caffeic-acid | HIV integrase inhibitor;lipoxygenase inhibitor;nitric oxide production inhibitor;tumor necrosis factor production inhibitor | 4 |
Clarifying note from Ted:
Yes, I think in the REP hub the MoAs are lexicographically ordered. When I spoke to Greg on Tuesday, he was also using a file of MoA annotations that he (or maybe you) had gotten from CMap a while back, and in that file we had ordered them according to how much literature support each had. If it helps, I think if there's any discrepancy between the MoAs for a given compound, the REP hub data should be considered the authority.
@shntnu - in #7 I begin processing the CLUE repurposing data. One additional question: it looks like the pert_id
column is not provided. Is this something I should add? I imagine it would simply be truncating the broad_id
column (e.g. BRD-K89787693-001-01-1
becomes BRD-K89787693
)
I can then also create a third file (something like repurposing_info_simple.tsv
) that only has unique columns pert_id
, pert_iname
, moa
, and target
. This file would be useful, because, presumably pert_id
is the primary mapping between the CLUE resources and the Cell Painting profiles. cc @niranjchandrasekaran
@shntnu - in #7 I begin processing the CLUE repurposing data. One additional question: it looks like the
pert_id
column is not provided. Is this something I should add? I imagine it would simply be truncating thebroad_id
column (e.g.BRD-K89787693-001-01-1
becomesBRD-K89787693
)
Yes, the first two components of the broad_id
are the pert_id
(defined here A unique identifier for a perturbagen that refers to the perturbagen in general, not to any particular batch or sample.)
I can then also create a third file (something like
repurposing_info_simple.tsv
) that only has unique columnspert_id
,pert_iname
,moa
, andtarget
. This file would be useful, because, presumablypert_id
is the primary mapping between the CLUE resources and the Cell Painting profiles. cc @niranjchandrasekaran
This seems reasonable. Note that with JUMP CP, we ended up with a proliferation of such derived files (but that was because we were dealing with many sources), and it can get confusing after a while (ideally, we'd design it like a relational database). But this will likely be the only derived metadata file, so that's ok.
the perturbation metadata file is added in #7
Based on discussions with @shntnu , @niranjchandrasekaran , @jrsacher , and @tnat1031 we've settled on repurposing hub metadata resources and decided to add 3 perturbation metadata files to the lincs_cell_painting repo.
Thanks everyone for the speedy contributions and comprehensive documentation!
Below is a description of the three resulting output files after processing. I will close this issue now that it is addressed in #7
Filename | Description | Derivation | Dimensions |
---|---|---|---|
repurposing_info.tsv |
The primary file storing every column in repurposing hub drugs and samples metadata file | Inner join on drugs and samples pert_iname column |
13,553 x 17 |
repurposing_info_long.tsv |
Stores every unique perturbation with singular values in target and moa columns |
Split aforementioned columns by pipe (e.g. entry ACT1<pipe>ACT3 becomes two independent rows) |
39,471 x 20 |
repurposing_simple.tsv |
Core identifiers and biological metadata | Splits off aliquot and batch information from Broad ID to create a pert_id and drops duplicate entries in target , moa , and pert_iname columns |
6,806 x 4 |
Note, steps 5 and 6 outlined in https://github.com/broadinstitute/lincs-cell-painting/issues/5#issuecomment-600208689 still need to be addressed:
- Add in a jupyter notebook to explore the data - ask questions like (how many compounds per MOA, how many targets, basically any interesting metadata question)
- Settle on a processing pipeline to get the cell painting profiles online! (to be described in a separate issue)
I will document these steps in a different issue and PR
We have additional information for each compound assayed in the Drug Repurposing Hub Cell Painting Dataset.
There are at least four files on AWS, that could all work as a reference to describe compound metadata.
Below I summarize each of the files
pert_info.txt
pert_iname_moa.txt
pert_id_to_iname.txt
pert_iname_moa_aggregated.txt
Confirmed pert_info.txt subset