CambridgeMolecularEngineering / chemdataextractor2

ChemDataExtractor Version 2.0
Other
130 stars 31 forks source link

installation issues and doc.records.serialize() gives empty output #12

Closed ViktorWeissenborn closed 1 year ago

ViktorWeissenborn commented 2 years ago

Hey,

i had some trouble installing the CDE2 package via pip inside a clean conda environment (i tried python versions from 3.7 to 3.9 without any difference). Also i tried the CDE2 installation on win terminal, powershell, wsl and on an ubuntu 20.04 system terminal, without any difference. The errorLogs were quite huge, i put them here as file attatchment. In general the installation always errored out while installing build dependencies for spacy 2.1.9.

Eventually i tried some fixes from stackoverflow...

https://stackoverflow.com/questions/34819221/why-is-python-setup-py-saying-invalid-command-bdist-wheel-on-travis-ci https://stackoverflow.com/questions/26053982/setup-script-exited-with-error-command-x86-64-linux-gnu-gcc-failed-with-exit https://github.com/explosion/spaCy#pip

...and used 'venv' instead of 'conda' and I got an installation without errors. The -V flag shows the version 2.1.1 and its possible to import it into python (as seen below). But using doc.records.serialize() gives an empty output neither for a normal sentence nor for a organic chemistry paper html-file, even though the allenNLP model got loaded correctly (as seen below).

caffeine@caffeine-ThinkPad-E590:~$ source .env/bin/activate (.env) caffeine@caffeine-ThinkPad-E590:~$ cde -V cde, version 2.1.1 (.env) caffeine@caffeine-ThinkPad-E590:~$ python Python 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information.

from chemdataextractor import Document import chemdataextractor chemdataextractor.version '2.1.1' from chemdataextractor import Document doc = Document('xy was extracted using ethanol.') doc.records.serialize() Initialising AllenNLP model ✔ [] f = open('/home/caffeine/Desktop/test/01_acs-jnatprod-5b00099.html', 'rb') doc = Document.from_file(f) doc.records.serialize() []

So for me in general it would be very helpful if u could tell me exactly how u installed the chemdataextractor2 package to see if there might be something i forgot or got wrong during the installation process. Also if u have any idea how i could fix my current installation (problem with "doc.records.serialize()") i would be very grateful. But a detailed description of how u installed CDE2 would be of my greatest help right now, because i guess all of this could be resolved by just a correct, clean reinstallation.

greetings Viktor

ErrorLog_CDE2_installation_botocore.txt ErrorLog_CDE2_installation.txt ErrorLog_CDE2_withUpdatedSetuptoolsInNonCondaEnv.txt

ti250 commented 2 years ago

Hi Viktor,

To get records you need to set the document to look for compounds, so in your case, adding the following lines should help:

from chemdataextractor.model import Compound
doc.models = [Compound]

With regards to the errors, it seems to me like pip couldn't find gcc or some suitable C compiler which is needed to use one of the libraries (a patched version of DAWG) on Python 3.7 and above, not sure why using venv fixed this though...