dhimmel / drugbank

User-friendly extensions of the DrugBank database
176 stars 75 forks source link

DrugBank 5.0: extract pubmed IDs for references #2

Open AlexanderHauser opened 7 years ago

AlexanderHauser commented 7 years ago
ref_text = protein.findtext("{ns}references[@format='textile']".format(ns=ns))

doesn't seem to catch anything on the latest drugbank 5 release.

Any bugfix for this?

dhimmel commented 7 years ago

Looks like DrugBank 5.0 uses a different schema for references. From https://www.drugbank.ca/releases/5-0-7/downloads/all-full-database, I see the following XML:

<references>
<articles>
  <article>
    <pubmed-id>10505536</pubmed-id>
    <citation>Turpie AG: Anticoagulants in acute coronary syndromes. Am J Cardiol. 1999 Sep 2;84(5A):2M-6M.</citation>
  </article>
  <article>
    <pubmed-id>10912644</pubmed-id>
    <citation>Warkentin TE: Venous thromboembolism in heparin-induced thrombocytopenia. Curr Opin Pulm Med. 2000 Jul;6(4):343-51.</citation>
  </article>
  <article>
    <pubmed-id>11055889</pubmed-id>
    <citation>Eriksson BI: New therapeutic options in deep vein thrombosis prophylaxis. Semin Hematol. 2000 Jul;37(3 Suppl 5):7-9.</citation>
  </article>
  <article>
    <pubmed-id>11467439</pubmed-id>
    <citation>Fabrizio MC: Use of ecarin clotting time (ECT) with lepirudin therapy in heparin-induced thrombocytopenia and cardiopulmonary bypass. J Extra Corpor Technol. 2001 May;33(2):117-25.</citation>
  </article>
  <article>
    <pubmed-id>11807012</pubmed-id>
    <citation>Szaba FM, Smiley ST: Roles for thrombin and fibrin(ogen) in cytokine/chemokine production and macrophage adhesion in vivo. Blood. 2002 Feb 1;99(3):1053-9.</citation>
  </article>
  <article>
    <pubmed-id>11752352</pubmed-id>
    <citation>Chen X, Ji ZL, Chen YZ: TTD: Therapeutic Target Database. Nucleic Acids Res. 2002 Jan 1;30(1):412-5.</citation>
  </article>
</articles>
<textbooks/>
<links/>
</references>

So you have to modify parse.ipynb. Perhaps you can create an XPath query to find all pubmed-id subelements of references. Perhaps something like (untested):

pubmed_ids = protein.findall("{ns}references//{ns}pubmed-id".format(ns=ns))
row['pubmed_ids'] = '|'.join(x.text for x in pubmed_ids)

Let us know whether this works. Also pull requests to upgrade this repo to DrugBank 5.0 would be of interest.

AlexanderHauser commented 7 years ago

Thanks for your quick response!

Your suggested XPath query seems to work, only 3 entries were None is returned, which might be a database issue. I have no further upgrades to the repo for Drugbank 5.0 compatibility, so hence please go forward with this (minor) change.

khughitt commented 5 years ago

In case it helps anyone else, the following changes (based on the suggestion above) fixed the issue for me:

pubmed_ids = protein.findall("{ns}references//{ns}pubmed-id".format(ns=ns))
row['pubmed_ids'] = '|'.join([x.text for x in pubmed_ids if x.text is not None])