What to do with pert_ids with conflicting information? [RESOLVED]

gwaybio commented 4 years ago

I am following up #7 with an additional notebook to create a simple, basic mapping file with only a handful of columns. This includes creating a pert_id column, which is a 13 character subset of the full 22 character broad_id column. The additional 9 characters contain batch info about the compound. More details about this procedure is here: https://github.com/broadinstitute/lincs-cell-painting/issues/5#issuecomment-601917321

In generating this data, I noticed that 16 perturbations (by pert_id) contain conflicting information (by pert_iname, moa, or target). I paste all of the conflicting info below:

	pert_id	pert_iname	moa	target
0	BRD-A03204438	allopregnanolone	GABA receptor positive allosteric modulator	GABRA1	GABRA2	GABRA3	GABRA4	GABRA5	GABRA6	GABRB2	GABRG2
1	BRD-A03204438	pregnanolone	GABA receptor positive allosteric modulator	nan
2	BRD-K05674516	sofosbuvir	RNA polymerase inhibitor	nan
3	BRD-K05674516	PSI-7976	HCV inhibitor	nan
4	BRD-K17498618	betaxolol	adrenergic receptor antagonist	ADRB1	ADRB2
5	BRD-K17498618	cisatracurium	acetylcholine receptor antagonist	CHRNA2
6	BRD-K20672254	pyrantel-tartrate	acetylcholine receptor agonist	CHRNA1
7	BRD-K20672254	pyrantel-pamoate	neuromuscular blocker	nan
8	BRD-K25650355	physostigmine-salicylate	acetylcholinesterase inhibitor	nan
9	BRD-K25650355	physostigmine	cholinesterase inhibitor	ACHE	BCHE
10	BRD-K29713308	mebhydrolin	antihistamine	nan
11	BRD-K29713308	mebhydroline-1,5-naphtalenedisulfonate	nan	nan
12	BRD-K35952844	calcium-gluceptate	nan	nan
13	BRD-K35952844	sodium-glucoheptonate	nan	nan
14	BRD-K41260949	valproic-acid	HDAC inhibitor	ABAT	ACADSB	ALDH5A1	HDAC1	HDAC2	HDAC9	OGDH	SCN10A	SCN11A	SCN1A	SCN1B	SCN2A	SCN2B	SCN3A	SCN3B	SCN4A	SCN4B	SCN5A	SCN7A	SCN8A	SCN9A
15	BRD-K41260949	divalproex-sodium	benzodiazepine receptor agonist	ALDH5A1
16	BRD-K66035042	mannitol-D	diuretic	nan
17	BRD-K66035042	sorbitol	mucolytic agent	nan
18	BRD-K71013094	neomycin-sulfate	bacterial 30S ribosomal subunit inhibitor	nan
19	BRD-K71013094	neomycin	bacterial 30S ribosomal subunit inhibitor	CXCR4
20	BRD-K79450420	INCB-024360	indoleamine 2,3-dioxygenase inhibitor	IDO1
21	BRD-K79450420	epacadostat	indoleamine 2,3-dioxygenase inhibitor	IDO1
22	BRD-K87202646	isoniazid	FABI inhibitor	CYP1A2	CYP2C19	CYP2C8	CYP2E1	CYP3A4
23	BRD-K87202646	pasiniazid	cyclooxygenase inhibitor	nan
24	BRD-K93632104	salicylic-acid	cyclooxygenase inhibitor	AKR1C1	PTGS1	PTGS2
25	BRD-K93632104	sodium-salicylate	prostanoid receptor antagonist	ASIC3	PTGS1	PTGS2
26	BRD-K97799481	theophylline	adenosine receptor antagonist	nan
27	BRD-K97799481	aminophylline	adenosine receptor antagonist	ADORA1	ADORA2A	ADORA2B	ADORA3	HDAC2	PDE3A	PDE3B	PDE4A	PDE4B	PDE4C	PDE4D
28	BRD-K97799481	oxtriphylline	adenosine receptor antagonist	ADORA1	ADORA2A	ADORA2B	ADORA3	HDAC2	PDE3A	PDE4A	PDE4B	PDE5A
29	BRD-M55114534	pyrvinium	androgen receptor antagonist	nan
30	BRD-M55114534	pyrvinium-pamoate	androgen receptor antagonist	AR

gwaybio commented 4 years ago

It looks like these 16 compounds will need to be resolved manually. I will use the following strategy to reconcile this list.

For pert_id with different pert_iname, look them up and select the most common pert_iname.
For compounds with one missing pert_iname, use the pert_iname entry with information.
Apply the same logic for compounds with different/missing targets.
Apply the same logic for compounds with different/missing moas.

I will document my findings in a notebook to be added in #7

cc @niranjchandrasekaran @shntnu

shntnu commented 4 years ago

I will use the following strategy to reconcile this list.

This sounds good! Please run this plan by Josh Sacher because IIUC he maintains these lists

jrsacher commented 4 years ago

I'm surprised there are only 16! This is one of the issues we're trying to address in the next update, which should be released this week (hopefully today). What file(s) are you working from? I can send the latest and see if that helps resolve these issues.

gwaybio commented 4 years ago

Thanks for quick reply @jrsacher!

What file(s) are you working from?

I'm using repurposing_drugs_20180907.txt and repurposing_samples_20180907.txt, both of which I downloaded from the repurposing homepage on CLUE.

This is one of the issues we're trying to address in the next update, which should be released this week (hopefully today). I can send the latest and see if that helps resolve these issues.

Thats amazing! Very glad @shntnu connected us.

My preference is to wait until the data is officially released on CLUE before ingesting into this repo, especially since the release date is so near. Retrieving the files from CLUE in this way makes data provenance easier than if data are handed off in harder-to-track channels.

Will the data be released on the same website, with the same filename_date.txt convention?

Thanks again!

jrsacher commented 4 years ago

It will all be on the same site and in the same format.

Since I generated the txt files that will be posted, here they are: repurposing_drugs_20200322.txt repurposing_samples_20200322.txt

Let me know if you run into any issues with them. It'll be a great test to have you use them before they're available to the entire world.

gwaybio commented 4 years ago

thanks for sending over @jrsacher - I ran the data through my existing pipeline. A couple of observations:

The pert_iname discrepancies in the 20180907 data that I describe in #6 are resolved
Instead of 16 perturbations (by pert_id) contain conflicting information (in pert_iname, moa, or target) (as described in this issue ☝️) there is only 1 conflict.

The duplicate pert_id is BRD-K34801930.

	pert_id	pert_iname	moa	target
0	BRD-K34801930	AZD5069	CC chemokine receptor antagonist	CXCR2
1	BRD-K34801930	SRT3190	CC chemokine receptor antagonist	NaN

jrsacher commented 4 years ago

Good catch! I really appreciate you digging through these files. It's a big help.

BRD-K34801930 is correct for SRT3190. AZD5069 is BRD-K00003297.

Here's the updated repurposing_samples file: repurposing_samples_20200322.txt

gwaybio commented 4 years ago

Awesome! Confirmed that of 6,806 unique pert_ids zero are duplicated

gwaybio commented 4 years ago

thanks again for these resources @jrsacher - if possible, can you ping this thread once the updated resources are posted on https://clue.io/repurposing? I will continue to monitor the site.

I'd like to ultimately compare the files by md5 in addition to date. Thanks!

jrsacher commented 4 years ago

The repurposing site and database have been updated, @gwaygenomics. All the latest should be available there and through the API.

gwaybio commented 4 years ago

The repurposing site and database have been updated, @gwaygenomics. All the latest should be available there and through the API.

this is amazing - thanks for the speedy turnaround on this. Closing the issue as it has now been resolved in #7

broadinstitute / lincs-cell-painting

What to do with pert_ids with conflicting information? [RESOLVED] #8