broadinstitute / lincs-cell-painting

Processed Cell Painting Data for the LINCS Drug Repurposing Project
BSD 3-Clause "New" or "Revised" License
25 stars 13 forks source link

What to do with pert_ids with conflicting information? [RESOLVED] #8

Closed gwaybio closed 4 years ago

gwaybio commented 4 years ago

I am following up #7 with an additional notebook to create a simple, basic mapping file with only a handful of columns. This includes creating a pert_id column, which is a 13 character subset of the full 22 character broad_id column. The additional 9 characters contain batch info about the compound. More details about this procedure is here: https://github.com/broadinstitute/lincs-cell-painting/issues/5#issuecomment-601917321

In generating this data, I noticed that 16 perturbations (by pert_id) contain conflicting information (by pert_iname, moa, or target). I paste all of the conflicting info below:

pert_id pert_iname moa target
0 BRD-A03204438 allopregnanolone GABA receptor positive allosteric modulator GABRA1 GABRA2 GABRA3 GABRA4 GABRA5 GABRA6 GABRB2 GABRG2
1 BRD-A03204438 pregnanolone GABA receptor positive allosteric modulator nan
2 BRD-K05674516 sofosbuvir RNA polymerase inhibitor nan
3 BRD-K05674516 PSI-7976 HCV inhibitor nan
4 BRD-K17498618 betaxolol adrenergic receptor antagonist ADRB1 ADRB2
5 BRD-K17498618 cisatracurium acetylcholine receptor antagonist CHRNA2
6 BRD-K20672254 pyrantel-tartrate acetylcholine receptor agonist CHRNA1
7 BRD-K20672254 pyrantel-pamoate neuromuscular blocker nan
8 BRD-K25650355 physostigmine-salicylate acetylcholinesterase inhibitor nan
9 BRD-K25650355 physostigmine cholinesterase inhibitor ACHE BCHE
10 BRD-K29713308 mebhydrolin antihistamine nan
11 BRD-K29713308 mebhydroline-1,5-naphtalenedisulfonate nan nan
12 BRD-K35952844 calcium-gluceptate nan nan
13 BRD-K35952844 sodium-glucoheptonate nan nan
14 BRD-K41260949 valproic-acid HDAC inhibitor ABAT ACADSB ALDH5A1 HDAC1 HDAC2 HDAC9 OGDH SCN10A SCN11A SCN1A SCN1B SCN2A SCN2B SCN3A SCN3B SCN4A SCN4B SCN5A SCN7A SCN8A SCN9A
15 BRD-K41260949 divalproex-sodium benzodiazepine receptor agonist ALDH5A1
16 BRD-K66035042 mannitol-D diuretic nan
17 BRD-K66035042 sorbitol mucolytic agent nan
18 BRD-K71013094 neomycin-sulfate bacterial 30S ribosomal subunit inhibitor nan
19 BRD-K71013094 neomycin bacterial 30S ribosomal subunit inhibitor CXCR4
20 BRD-K79450420 INCB-024360 indoleamine 2,3-dioxygenase inhibitor IDO1
21 BRD-K79450420 epacadostat indoleamine 2,3-dioxygenase inhibitor IDO1
22 BRD-K87202646 isoniazid FABI inhibitor CYP1A2 CYP2C19 CYP2C8 CYP2E1 CYP3A4
23 BRD-K87202646 pasiniazid cyclooxygenase inhibitor nan
24 BRD-K93632104 salicylic-acid cyclooxygenase inhibitor AKR1C1 PTGS1 PTGS2
25 BRD-K93632104 sodium-salicylate prostanoid receptor antagonist ASIC3 PTGS1 PTGS2
26 BRD-K97799481 theophylline adenosine receptor antagonist nan
27 BRD-K97799481 aminophylline adenosine receptor antagonist ADORA1 ADORA2A ADORA2B ADORA3 HDAC2 PDE3A PDE3B PDE4A PDE4B PDE4C PDE4D
28 BRD-K97799481 oxtriphylline adenosine receptor antagonist ADORA1 ADORA2A ADORA2B ADORA3 HDAC2 PDE3A PDE4A PDE4B PDE5A
29 BRD-M55114534 pyrvinium androgen receptor antagonist nan
30 BRD-M55114534 pyrvinium-pamoate androgen receptor antagonist AR
gwaybio commented 4 years ago

It looks like these 16 compounds will need to be resolved manually. I will use the following strategy to reconcile this list.

  1. For pert_id with different pert_iname, look them up and select the most common pert_iname.
  2. For compounds with one missing pert_iname, use the pert_iname entry with information.
  3. Apply the same logic for compounds with different/missing targets.
  4. Apply the same logic for compounds with different/missing moas.

I will document my findings in a notebook to be added in #7

cc @niranjchandrasekaran @shntnu

shntnu commented 4 years ago

I will use the following strategy to reconcile this list.

This sounds good! Please run this plan by Josh Sacher because IIUC he maintains these lists

jrsacher commented 4 years ago

I'm surprised there are only 16! This is one of the issues we're trying to address in the next update, which should be released this week (hopefully today). What file(s) are you working from? I can send the latest and see if that helps resolve these issues.

gwaybio commented 4 years ago

Thanks for quick reply @jrsacher!

What file(s) are you working from?

I'm using repurposing_drugs_20180907.txt and repurposing_samples_20180907.txt, both of which I downloaded from the repurposing homepage on CLUE.

This is one of the issues we're trying to address in the next update, which should be released this week (hopefully today). I can send the latest and see if that helps resolve these issues.

Thats amazing! Very glad @shntnu connected us.

My preference is to wait until the data is officially released on CLUE before ingesting into this repo, especially since the release date is so near. Retrieving the files from CLUE in this way makes data provenance easier than if data are handed off in harder-to-track channels.

Will the data be released on the same website, with the same filename_date.txt convention?

Thanks again!

jrsacher commented 4 years ago

It will all be on the same site and in the same format.

Since I generated the txt files that will be posted, here they are: repurposing_drugs_20200322.txt repurposing_samples_20200322.txt

Let me know if you run into any issues with them. It'll be a great test to have you use them before they're available to the entire world.

gwaybio commented 4 years ago

thanks for sending over @jrsacher - I ran the data through my existing pipeline. A couple of observations:

The duplicate pert_id is BRD-K34801930.

pert_id pert_iname moa target
0 BRD-K34801930 AZD5069 CC chemokine receptor antagonist CXCR2
1 BRD-K34801930 SRT3190 CC chemokine receptor antagonist NaN
jrsacher commented 4 years ago

Good catch! I really appreciate you digging through these files. It's a big help.

BRD-K34801930 is correct for SRT3190. AZD5069 is BRD-K00003297.

Here's the updated repurposing_samples file: repurposing_samples_20200322.txt

gwaybio commented 4 years ago

Awesome! Confirmed that of 6,806 unique pert_ids zero are duplicated

gwaybio commented 4 years ago

thanks again for these resources @jrsacher - if possible, can you ping this thread once the updated resources are posted on https://clue.io/repurposing? I will continue to monitor the site.

I'd like to ultimately compare the files by md5 in addition to date. Thanks!

jrsacher commented 4 years ago

The repurposing site and database have been updated, @gwaygenomics. All the latest should be available there and through the API.

gwaybio commented 4 years ago

The repurposing site and database have been updated, @gwaygenomics. All the latest should be available there and through the API.

this is amazing - thanks for the speedy turnaround on this. Closing the issue as it has now been resolved in #7