Data4Democracy / drug-spending

Project to understand pharmaceutical spending, currently focused on US government programs.
73 stars 46 forks source link

Investigate + add data from drugbank.ca #61

Open jenniferthompson opened 7 years ago

jenniferthompson commented 7 years ago

@cduvallet made us aware of a site (drugbank.ca) that looks like it has very promising data! We need someone to

  1. Contact them to make sure it's OK if we download and process the data and include it in our data.world repo, of course giving proper citations (their TOS look promising)
  2. Determine exactly what data is available and what would be most helpful in our context
  3. Download, tidy and add that data to our data.world repo, along with a data dictionary
cduvallet commented 7 years ago

Thanks for making this issue @jenniferthompson!

cduvallet commented 7 years ago

They got back to me pretty quickly, and had some questions about data.world that I'm not sure I know the answer to:

Looks like an interesting project, thanks for reaching out!

I checked out your site and noticed a couple of issues:

1) Data.world looks like a commercial project that requires people have accounts to download data. It doesn't look like they have a good way to post the licenses for datasets? Maybe I am not understanding what data.world is.

2) I don't see a clear indication of the license for the datasets available through your website, or clear citations to the datasets there?

Your use case looks like a non-commercial use case, so that should be fine but, when our data is shared it has to be shared both with a citation and the license we share our data under.

We also have 2 datasets that are public domain and you can do whatever you want with them, on this page: https://www.drugbank.ca/releases/latest#open-data

They include DrugBank identifiers, names, and synonyms to permit easy linking and integration into any type of project.

Is there any way we can include their license and citation on data.world? I'm pretty sure it will be more characters than are allowed in the "description" on data.world, and I'm not sure where else dataset metadata can be put on data.world (which is pretty surprising...)

Alternatively, should we just stick with the public domain data?

mattgawarecki commented 7 years ago

I would suggest reaching out to one of our contacts with data.world inside the D4D Slack group. I believe @gabriela might be a good first-line person for this. I'll ping her in the channel now, and we can continue the conversation there.

On Apr 27, 2017 5:54 PM, "cduvallet" notifications@github.com wrote:

They got back to me pretty quickly, and had some questions about data.world that I'm not sure I know the answer to:

Looks like an interesting project, thanks for reaching out!

I checked out your site and noticed a couple of issues:

1.

Data.world looks like a commercial project that requires people have accounts to download data. It doesn't look like they have a good way to post the licenses for datasets? Maybe I am not understanding what data.world is. 2.

I don't see a clear indication of the license for the datasets available through your website, or clear citations to the datasets there?

Your use case looks like a non-commercial use case, so that should be fine but, when our data is shared it has to be shared both with a citation and the license we share our data under.

We also have 2 datasets that are public domain and you can do whatever you want with them, on this page: https://www.drugbank.ca/ releases/latest#open-data

They include DrugBank identifiers, names, and synonyms to permit easy linking and integration into any type of project.

Is there any way we can include their license and citation on data.world? I'm pretty sure it will be more characters than are allowed in the "description" on data.world, and I'm not sure where else dataset metadata can be put on data.world (which is pretty surprising...)

Alternatively, should we just stick with the public domain data?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Data4Democracy/drug-spending/issues/61#issuecomment-297861342, or mute the thread https://github.com/notifications/unsubscribe-auth/AAo5cwWdhJPJ4FbhHSzH5BdueEDrxAQYks5r0Ry5gaJpZM4NJrGR .

forzavitale commented 7 years ago

hi all-- first time jumping in here! at the NYC hackathon rn, seems like this issue is pretty recent and would like to start munging something.... guidance?

cduvallet commented 7 years ago

@forzavitale I pinged the DrugBank people again to ask if we could just include the license and citation info in the header of the file, since we can't assign it to the file directly via data.world. They haven't gotten back to me about that though. That said, in my opinion it should be fine so you can probably start working on the data. Let's just make sure to check back in with them before we post the data to data.world.

Alternatively, you can poke around the public domain data and see if that's enough to get us what we want!

jenniferthompson commented 7 years ago

I think that's a good plan @cduvallet - and at the speed the data.world folks move (read: blazing fast), it's entirely plausible that we might be able to assign a file-specific license by the time we're ready to post it.

cduvallet commented 7 years ago

Update, just heard back from the DrugBank people and they said that including the info in the header of the file is fine. Full speed ahead!

@forzavitale can you update us on your progress from the hackathon (if you ended up working on this)?

jenniferthompson commented 7 years ago

Fantastic! Thanks so much for following up, @cduvallet! 🎉

darwinyfu commented 6 years ago

Is this still a project that needs help? I see the label but comments are fairly old.

Been lurking on D4D for a while but interested in working on something.

darya-akimova commented 6 years ago

Hello! The project has been dormant for a while (hence the old comments), I'm one of the people that's trying to get this project going again. Any issue with the label status-under-review can be ignored for now, it either can't be tackled yet or may need to be trimmed/reformatted. This is one of the older issues that I thought would be good to try and get through because drugbank.ca materials seem to be very useful for our current goal of matching drugs to therapeutic uses.

darya-akimova commented 6 years ago

In PR #83 @proof-by-accident investigated how many of the Medicare drugs can be found in the drugbank.ca data. The results seem similar to matching attempts attempts from other sources: a good number of drugs can be matched easily on the first pass, but about twice as many were not matched and will probably require a non-trivial amount of research to match the rest properly.

acutrell commented 6 years ago

I don't have the coding ability to do this, but I am knowledgeable about the domain as an informatics pharmacist and willing to offer some help from that aspect. Pretty sure the answer to this problem is the Structured Product Labeling (SPL). It is a document markup standard approved by Health Level Seven (HL7) and adopted by FDA as a mechanism for exchanging product and facility information.

Different datasets use different drug identifiers: brand name, generic name, NDA, NDC, etc. and it is hard to find the same drug in different datasets. The OpenFDA features harmonization of drug identifiers and fields for various pharmacological use are part of the dataset. Take a look: https://open.fda.gov/drug/label/reference/

darya-akimova commented 6 years ago

Oh this seems great! The OpenFDA might be just what we need because you're right, we have been running into the issue where not everything is in one dataset and the names can be inconsistent between datasets. Thanks for this suggestion.

veena-v-g commented 6 years ago

Can I help?

TBusen commented 5 years ago

Is this still active? Can I start this or is this throw away work?