Closed gwaybio closed 4 years ago
It looks like these 16 compounds will need to be resolved manually. I will use the following strategy to reconcile this list.
pert_id
with different pert_iname
, look them up and select the most common pert_iname
.pert_iname
, use the pert_iname
entry with information. I will document my findings in a notebook to be added in #7
cc @niranjchandrasekaran @shntnu
I will use the following strategy to reconcile this list.
This sounds good! Please run this plan by Josh Sacher because IIUC he maintains these lists
I'm surprised there are only 16! This is one of the issues we're trying to address in the next update, which should be released this week (hopefully today). What file(s) are you working from? I can send the latest and see if that helps resolve these issues.
Thanks for quick reply @jrsacher!
What file(s) are you working from?
I'm using repurposing_drugs_20180907.txt
and repurposing_samples_20180907.txt
, both of which I downloaded from the repurposing homepage on CLUE.
This is one of the issues we're trying to address in the next update, which should be released this week (hopefully today). I can send the latest and see if that helps resolve these issues.
Thats amazing! Very glad @shntnu connected us.
My preference is to wait until the data is officially released on CLUE before ingesting into this repo, especially since the release date is so near. Retrieving the files from CLUE in this way makes data provenance easier than if data are handed off in harder-to-track channels.
Will the data be released on the same website, with the same filename_date.txt convention?
Thanks again!
It will all be on the same site and in the same format.
Since I generated the txt
files that will be posted, here they are:
repurposing_drugs_20200322.txt
repurposing_samples_20200322.txt
Let me know if you run into any issues with them. It'll be a great test to have you use them before they're available to the entire world.
thanks for sending over @jrsacher - I ran the data through my existing pipeline. A couple of observations:
pert_iname
discrepancies in the 20180907 data that I describe in #6 are resolvedpert_id
) contain conflicting information (in pert_iname
, moa
, or target
) (as described in this issue ☝️) there is only 1 conflict.The duplicate pert_id
is BRD-K34801930
.
pert_id | pert_iname | moa | target | |
---|---|---|---|---|
0 | BRD-K34801930 | AZD5069 | CC chemokine receptor antagonist | CXCR2 |
1 | BRD-K34801930 | SRT3190 | CC chemokine receptor antagonist | NaN |
Good catch! I really appreciate you digging through these files. It's a big help.
BRD-K34801930 is correct for SRT3190. AZD5069 is BRD-K00003297.
Here's the updated repurposing_samples file: repurposing_samples_20200322.txt
Awesome! Confirmed that of 6,806 unique pert_ids zero are duplicated
thanks again for these resources @jrsacher - if possible, can you ping this thread once the updated resources are posted on https://clue.io/repurposing? I will continue to monitor the site.
I'd like to ultimately compare the files by md5
in addition to date. Thanks!
The repurposing site and database have been updated, @gwaygenomics. All the latest should be available there and through the API.
The repurposing site and database have been updated, @gwaygenomics. All the latest should be available there and through the API.
this is amazing - thanks for the speedy turnaround on this. Closing the issue as it has now been resolved in #7
I am following up #7 with an additional notebook to create a simple, basic mapping file with only a handful of columns. This includes creating a
pert_id
column, which is a 13 character subset of the full 22 characterbroad_id
column. The additional 9 characters contain batch info about the compound. More details about this procedure is here: https://github.com/broadinstitute/lincs-cell-painting/issues/5#issuecomment-601917321In generating this data, I noticed that 16 perturbations (by
pert_id
) contain conflicting information (bypert_iname
,moa
, ortarget
). I paste all of the conflicting info below: