dhimmel / drugbank

User-friendly extensions of the DrugBank database
179 stars 75 forks source link

Regarding timeout when using pubchempy.get_compounds #1

Closed anuragpassi closed 8 years ago

anuragpassi commented 8 years ago

Hi, I am trying to reproduce the result from your pubchem-parse.py and somehow I get a timeout error. How can I resolve it?

dhimmel commented 8 years ago

Okay so you're running into trouble with pubchempy.get_compounds in the pubchem-map.ipynb notebook.

Can you check whether any pubchempy queries work on your setup? For example, does the following command succeed?

import pubchempy
inchi = "InChI=1S/C6H8O4/c1-9-5(7)3-4-6(8)10-2/h3-4H,1-2H3/b4-3+"
pubchempy.get_compounds(inchi, namespace='inchi')
anuragpassi commented 8 years ago

yes they are working however even 1000 compounds give a timeout error

dhimmel commented 8 years ago

Hmm, in pubchem-map.ipynb, I only request one compound at a time. Can you split your query into many smaller queries?

See the pubchempy docs about avoiding a timeout error.

anuragpassi commented 8 years ago

Yes thats what i am trying

Sent from my iPhone

On May 18, 2016, at 10:27 PM, Daniel Himmelstein notifications@github.com wrote:

Hmm, in pubchem-map.ipynb, I only request one compound at a time. Can you split your query into many smaller queries?

See the pubchempy docs about avoiding a timeout error.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub

anuragpassi commented 8 years ago

Hi,

Well I ran the drugbank mapping and got the data. However, when I tried to map the updated drugbank with the SIDER stich ids i only got some 340 drugs. Somehow the drugbank,pubchem and STITCH ids are not mapping and I am missing a lot of entries.

What can I do in this case.

Please advise.

Regards, Anurag

On Wed, May 18, 2016 at 10:41 PM, Anurag Passi <anuragpassibioinfo@gmail.com

wrote:

Yes thats what i am trying

Sent from my iPhone

On May 18, 2016, at 10:27 PM, Daniel Himmelstein notifications@github.com wrote:

Hmm, in pubchem-map.ipynb, I only request one compound at a time. Can you split your query into many smaller queries?

See the pubchempy docs about avoiding a timeout error http://pubchempy.readthedocs.io/en/v1.0.3/guide/advanced.html#avoiding-timeouterror .

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/dhimmel/drugbank/issues/1#issuecomment-220211985

Anurag Passi Sr. Research Fellow OSDD, CSIR 00-91-9899767938 skype: anurag.passi

dhimmel commented 8 years ago

when I tried to map the updated drugbank with the SIDER stich ids i only got some 340 drugs

I recover more than 340 compounds when mapping to SIDER. Check out how I map to the STITCH IDs to DrugBank in dhimmel/SIDER4.

anuragpassi commented 8 years ago

So is the drugbank to pubchem mapping is recent(pubchem.tsv)???

Sent from my iPhone

On May 20, 2016, at 12:05 PM, Daniel Himmelstein notifications@github.com wrote:

when I tried to map the updated drugbank with the SIDER stich ids i only got some 340 drugs

I recover more than 340 compounds when mapping to SIDER. Check out how I map to the STITCH IDs to DrugBank in dhimmel/SIDER4.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub

dhimmel commented 8 years ago

So is the drugbank to pubchem mapping is recent(pubchem.tsv)???

@anuragpassi I don't understand what you're asking. Try to be more clear and describe exactly what you mean.

dhimmel commented 8 years ago

I'm guessing you're asking whether the pubchem.tsv file is recent. I got confused because there was no space between "recent" and "(pubchem.tsv)". The commit date for dhimmel/drugbank@3e87872db5fca5ac427ce27464ab945c0ceb4ec6 is Apr 13, 2015. Note that we used the UniChem connectivity search for the DrugBank mapping in dhimmel/SIDER4.

anuragpassi commented 8 years ago

Oh. I thought that Drugbank was first mapped to get PubChem IDs and then the PubChem IDs were mapped with STITCH IDs to get the DrugBank-SIDEEFFECT relation.

On Fri, May 20, 2016 at 6:29 PM, Daniel Himmelstein < notifications@github.com> wrote:

I'm guessing you're asking whether the pubchem.tsv https://github.com/dhimmel/drugbank/blob/3e87872db5fca5ac427ce27464ab945c0ceb4ec6/data/mapping/pubchem.tsv file is recent. I got confused because there was no space between "recent" and "(pubchem.tsv)". The commit date for dhimmel/drugbank@3e87872 https://github.com/dhimmel/drugbank/commit/3e87872db5fca5ac427ce27464ab945c0ceb4ec6 is Apr 13, 2015. Note that we used the UniChem connectivity search https://thinklab.com/discussion/unifying-drug-vocabularies/40#5 for the DrugBank mapping in dhimmel/SIDER4.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/dhimmel/drugbank/issues/1#issuecomment-220733482

Anurag Passi Sr. Research Fellow OSDD, CSIR 00-91-9899767938 skype: anurag.passi

anuragpassi commented 8 years ago

Dear Daniel, I have tried your code for mapping DrugBank ID to STITCH ID and SIDE EFFECTS. However, when I try to use the updated DrugBank data form the website, I do not get much data. I was wondering if you could run your program on latest drugbank data so that I can match your output with mine.

Please advise.

Regards, Anurag

On Fri, May 20, 2016 at 6:58 PM, Anurag Passi anuragpassibioinfo@gmail.com wrote:

Oh. I thought that Drugbank was first mapped to get PubChem IDs and then the PubChem IDs were mapped with STITCH IDs to get the DrugBank-SIDEEFFECT relation.

On Fri, May 20, 2016 at 6:29 PM, Daniel Himmelstein < notifications@github.com> wrote:

I'm guessing you're asking whether the pubchem.tsv https://github.com/dhimmel/drugbank/blob/3e87872db5fca5ac427ce27464ab945c0ceb4ec6/data/mapping/pubchem.tsv file is recent. I got confused because there was no space between "recent" and "(pubchem.tsv)". The commit date for dhimmel/drugbank@3e87872 https://github.com/dhimmel/drugbank/commit/3e87872db5fca5ac427ce27464ab945c0ceb4ec6 is Apr 13, 2015. Note that we used the UniChem connectivity search https://thinklab.com/discussion/unifying-drug-vocabularies/40#5 for the DrugBank mapping in dhimmel/SIDER4.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/dhimmel/drugbank/issues/1#issuecomment-220733482

Anurag Passi Sr. Research Fellow OSDD, CSIR 00-91-9899767938 skype: anurag.passi

Anurag Passi Sr. Research Fellow OSDD, CSIR 00-91-9899767938 skype: anurag.passi

dhimmel commented 8 years ago

I was wondering if you could run your program on latest drugbank data so that I can match your output with mine.

Sorry I won't have time for this in the near future. I may update my mapping in the future. The conversion from STITCH ID to pubchem_ids is trivial:

def stitch_flat_to_pubchem(cid):
    assert cid.startswith('CID')
    return int(cid[3:]) - 1e8

def stitch_stereo_to_pubchem(cid):
    assert cid.startswith('CID')
    return int(cid[3:])

To go from pubchem to DrugBank, you could rerun pubchem-map.ipynb (which maybe you have done) which maps by inchi. You can see the results of when I ran it at data/pubchem-mapping.tsv.

You could also use the mapping in data/mapping/pubchem.tsv which is generated using UniChem's connectivity search. This mapping will be more fuzzy than the first method (small chemical differences are ignored).

If your having lots of trouble with redoing the mapping, I'd suggest proceeding with either of the existing mappings.

anuragpassi commented 8 years ago

I actually did use the existing mappings too but many drugs are missing. Do not know why. I will give the mappings another try.

Thank you

Sent from my iPhone

On May 27, 2016, at 5:16 PM, Daniel Himmelstein notifications@github.com wrote:

I was wondering if you could run your program on latest drugbank data so that I can match your output with mine.

Sorry I won't have time for this in the near future. I may update my mapping in the future. The conversion from STITCH ID to pubchem_ids is trivial:

def stitch_flat_to_pubchem(cid): assert cid.startswith('CID') return int(cid[3:]) - 1e8

def stitch_stereo_to_pubchem(cid): assert cid.startswith('CID') return int(cid[3:]) To go from pubchem to DrugBank, you could rerun pubchem-map.ipynb (which maybe you have done) which maps by inchi. You can see the results of when I ran it at data/pubchem-mapping.tsv.

You could also use the mapping in data/mapping/pubchem.tsv which is generated using UniChem's connectivity search. This mapping will be more fuzzy than the first method (small chemical differences are ignored).

If your having lots of trouble with redoing the mapping, I'd suggest proceeding with either of the existing mappings.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

anuragpassi commented 8 years ago

Also can you guide me as to which script to use to parse drugbank data to get inchis and map to pubchem.

Sent from my iPhone

On May 27, 2016, at 5:16 PM, Daniel Himmelstein notifications@github.com wrote:

I was wondering if you could run your program on latest drugbank data so that I can match your output with mine.

Sorry I won't have time for this in the near future. I may update my mapping in the future. The conversion from STITCH ID to pubchem_ids is trivial:

def stitch_flat_to_pubchem(cid): assert cid.startswith('CID') return int(cid[3:]) - 1e8

def stitch_stereo_to_pubchem(cid): assert cid.startswith('CID') return int(cid[3:]) To go from pubchem to DrugBank, you could rerun pubchem-map.ipynb (which maybe you have done) which maps by inchi. You can see the results of when I ran it at data/pubchem-mapping.tsv.

You could also use the mapping in data/mapping/pubchem.tsv which is generated using UniChem's connectivity search. This mapping will be more fuzzy than the first method (small chemical differences are ignored).

If your having lots of trouble with redoing the mapping, I'd suggest proceeding with either of the existing mappings.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

dhimmel commented 8 years ago

can you guide me as to which script to use to parse drugbank data

This mapping is accomplished by running two Python notebooks in the following order.

  1. parse.ipynb converts the XML download to TSV with an inchi column.
  2. pubchem-map.ipynb maps DrugBank to PubChem using inchi.
anuragpassi commented 8 years ago

Thank you. I'll try.

Sent from my iPhone

On May 27, 2016, at 6:14 PM, Daniel Himmelstein notifications@github.com wrote:

can you guide me as to which script to use to parse drugbank data

This mapping is accomplished by running two Python notebooks in the following order.

parse.ipynb converts the XML download to TSV with an inchi column. pubchem-map.ipynb maps DrugBank to PubChem using inchi. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

anuragpassi commented 8 years ago

Dear Daniel,

I have created these two files : Drugbank.tsv and pubchem-mapping.tsv. I am now running the SIDER4.0 code but apparently i get an error:

MergeError: No common columns to perform merge on

I believe that the pubchem_id in drugbank.tsv are not mapping to pubchem-mapping.tsv. What can be the problem here. I am trying to run this code.

columns = [ 'stitch_id_flat', 'stitch_id_sterio', 'umls_cui_from_label', 'meddra_type', 'umls_cui_from_meddra', 'side_effect_name',]se_df = pandas.read_table('download/meddra_all_se.tsv.gz', names=columns)se_df['pubchem_id'] = se_df.stitch_id_sterio.map(stitch_stereo_to_pubchem)se_df = drugbank_map_df.merge(se_df) ### THIS IS WHERE I AM GETTING ERROR se_df.head(2)

I am attaching the two input files with this email.

Please advise.

Regards,

Anurag

On Fri, May 27, 2016 at 6:22 PM, Anurag Passi anuragpassibioinfo@gmail.com wrote:

Thank you. I'll try.

Sent from my iPhone

On May 27, 2016, at 6:14 PM, Daniel Himmelstein notifications@github.com wrote:

can you guide me as to which script to use to parse drugbank data

This mapping is accomplished by running two Python notebooks in the following order.

  1. parse.ipynb https://github.com/dhimmel/drugbank/blob/55587651ee9417e4621707dac559d84c984cf5fa/parse.ipynb converts the XML download to TSV with an inchi column.
  2. pubchem-map.ipynb https://github.com/dhimmel/drugbank/blob/55587651ee9417e4621707dac559d84c984cf5fa/pubchem-map.ipynb maps DrugBank to PubChem using inchi.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dhimmel/drugbank/issues/1#issuecomment-222264603, or mute the thread https://github.com/notifications/unsubscribe/APNV7dnW-B31xDWOB0w7aOS9Sd_My867ks5qF2yqgaJpZM4Ih0S1 .

Anurag Passi Sr. Research Fellow OSDD, CSIR 00-91-9899767938 skype: anurag.passi

dhimmel commented 8 years ago

@anuragpassi the attached files don't show up on the GitHub issue. I recommend replying via the GitHub issue interface, so you can see exactly how your message will get displayed.

Not sure why you are getting the error. I recommend viewing the head of each dataframe and making sure they have common columns to merge on.