ModelNotFoundError for punkt_chem-1.0.pickle

a11525 commented 1 year ago

Hello,

I'm trying to use ChemDataExtractor2 for my project. I followed the instructions in the documentation, installed the package via pip (pip install chemdataextractor2), and tried to run the example code provided. However, I'm encountering a ModelNotFoundError stating "Could not load models/punkt_chem-1.0.pickle. Have you run cde data download?". ( my code from chemdataextractor.doc import Paragraph strings = "Ring A is -C3-C12 cycloalkyl, 3- to 14-membered heterocyclyl having 1 to 4 heteroatoms selected from oxygen, nitrogen, and sulfur, -Ce-Cio aryl, or 5- to 14-membered heteroaryl having 1 to 4 heteroatoms selected from oxygen, nitrogen, and sulfur, wherein each cycloalkyl, heterocyclyl, aryl, or heteroaryl of Ring A is optionally substituted with one or more substituents selected from the group consisting of halogen, -C1-C6 alkyl, -OR, -OC(O)R’, -NR2, -NRC(O)R’, -NRS(O)2R’, -CN, -NO2, -SR, -C(O)R’, -C(O)OR, -C(O)NR2, -S(O)2R’, and -S(O)2NR2;" para = Paragraph(strings) para.sentences )

I've attempted to uninstall and reinstall the package several times, but the error persists. Additionally, during the installation of ChemDataExtractor2, I noticed an error occurring midway but the installation seemed to continue regardless. Here's the error message:

ERROR: Command errored out with exit status 1: /home/sanghoon/anaconda3/envs/rdkit_env/bin/python3.9 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = ... (truncated for readability)

Then followed by a warning:

WARNING: Discarding https://files.pythonhosted.org/packages/ce/5e/8f21b3f32ea3566764d1c90f4360703be7d1739ed7b51cbf89bed00fa331/spacy-2.1.0.tar.gz#sha256=e3dbde5b560fb9dd3706bd6838e66e28119b6aa17bcb0711d53e95c830bcf0a7 (from https://pypi.org/simple/spacy/) (requires-python:>=2.7,!=3.0.,!=3.1.,!=3.2.,!=3.3.). Command errored out with exit status 1: ... (truncated for readability)

I am unsure how to resolve this issue. Could you provide some guidance or assistance on this matter? Any help would be greatly appreciated.

Thank you.

Kind regards, Sanghoon Lee

OBrink commented 1 year ago

Hey Sanghoon (@a11525),

I tried to reproduce your problem, but I can't when running the code that you have provided. Everything runs without problems. Something must be wrong with your setup. The path in your error message (/home/sanghoon/anaconda3/envs/rdkit_env/bin/python3.9) indicates that you are trying to run this in a Python 3.9 environment. According to the setup instructions in the README, the current version of ChemDataExtractor is only compatible with Python 3.5-3.8. This might be the root of the problem. I have had problems with the proper installation of ChemDataExtractor since version 2.1. I recommend running it with this Docker image.

Kind regards, Otto

OBrink commented 1 year ago

One thing just came to my mind:

Have you run cde data download in the command shell in your environment?

a11525 commented 1 year ago

Hello Otto @OBrink

My problem seems to be an error due to a version conflict during installation. When I installed a new environment and used pip install chemdataextractor2 , it installed well. However, when importing Chemdataextractor.doc.Paragraph, the int issue of issue #29 was resolved, and the np.float issue that occurred was also corrected. After that, the nltk package gives me AttributeError: partially initialized module 'nltk' has no attribute 'data' (most likely due to a circular import) has occurred. Here is the part where the error occurred. how should i solve it? sent_tokenizer=nltk.data.LazyLoader("tokenizers/punkt/english.pickle"),

Kind regards, Sanghoon Lee

a11525 commented 1 year ago

The problem was solved by changing all np.float to float and executing import nltk.data first before chedataextractor import. thanks. However, when I call cems or token using Document or Paragraph in a specific sentence, Ran out of input problem occurs. What is the problem? The sentence I wanted did not work, so I followed the example of the document, but it did not work.

para_exam = Paragraph('1,4-Dibromoanthracene was prepared from 1,4-diaminoanthraquinone. 1H NMR spectra were recorded on a 300 MHz BRUKER DPX300 spectrometer.') para_exam.tokens

EOFError Traceback (most recent call last) Cell In[21], line 1 ----> 1 para_exam.tokens

File ~/anaconda3/envs/patent_py38/lib/python3.8/site-packages/chemdataextractor/doc/text.py:288, in Text.tokens(self) 286 @property 287 def tokens(self): --> 288 return [sent.tokens for sent in self.sentences]

File ~/anaconda3/envs/patent_py38/lib/python3.8/site-packages/chemdataextractor/doc/text.py:288, in (.0) 286 @property 287 def tokens(self): --> 288 return [sent.tokens for sent in self.sentences]

File ~/anaconda3/envs/patent_py38/lib/python3.8/site-packages/chemdataextractor/utils.py:29, in memoized_property..fget_memoized(self) 26 @functools.wraps(fget) 27 def fget_memoized(self): 28 if not hasattr(self, attr_name): ---> 29 setattr(self, attr_name, fget(self)) 30 return getattr(self, attr_name)

File ~/anaconda3/envs/patent_py38/lib/python3.8/site-packages/chemdataextractor/doc/text.py:525, in Sentence.tokens(self) 523 @memoized_property 524 def tokens(self): --> 525 tokens = self.word_tokenizer.get_word_tokens(self) 526 for token in tokens: 527 token.text = "".join(ch for ch in token.text if unicodedata.category(ch)[0] != "C")

File ~/anaconda3/envs/patent_py38/lib/python3.8/site-packages/chemdataextractor/nlp/tokenize.py:313, in WordTokenizer.get_word_tokens(self, sentence, additional_regex) 311 if not additional_regex: 312 additional_regex = self.get_additional_regex(sentence) --> 313 return sentence._tokens_for_spans(self.span_tokenize(sentence.text, additional_regex))

File ~/anaconda3/envs/patent_py38/lib/python3.8/site-packages/chemdataextractor/doc/text.py:531, in Sentence._tokens_for_spans(self, spans) 530 def _tokens_for_spans(self, spans): --> 531 toks = [RichToken( 532 text=self.text[span[0]:span[1]], 533 start=span[0] + self.start, 534 end=span[1] + self.start, 535 lexicon=self.lexicon, 536 sentence=self 537 ) for span in spans] 538 return toks

File ~/anaconda3/envs/patent_py38/lib/python3.8/site-packages/chemdataextractor/doc/text.py:531, in (.0) 530 def _tokens_for_spans(self, spans): --> 531 toks = [RichToken( 532 text=self.text[span[0]:span[1]], 533 start=span[0] + self.start, 534 end=span[1] + self.start, 535 lexicon=self.lexicon, 536 sentence=self 537 ) for span in spans] 538 return toks

File ~/anaconda3/envs/patent_py38/lib/python3.8/site-packages/chemdataextractor/doc/text.py:1049, in RichToken.init(self, text, start, end, lexicon, sentence) 1048 def init(self, text, start, end, lexicon, sentence): -> 1049 super(RichToken, self).init(text, start, end, lexicon) 1050 self.sentence = sentence 1051 self._tags = {}

File ~/anaconda3/envs/patent_py38/lib/python3.8/site-packages/chemdataextractor/doc/text.py:1016, in Token.init(self, text, start, end, lexicon) 1014 #: The lexicon for this token. 1015 self.lexicon = lexicon -> 1016 self.lexicon.add(text)

File ~/anaconda3/envs/patent_py38/lib/python3.8/site-packages/chemdataextractor/nlp/lexicon.py:125, in Lexicon.add(self, text) 102 if text not in self.lexemes: 103 normalized = self.normalized(text) 104 self.lexemes[text] = Lexeme( 105 text=text, 106 normalized=normalized, 107 lower=self.lower(normalized), 108 first=self.first(normalized), 109 suffix=self.suffix(normalized), 110 shape=self.shape(normalized), 111 length=self.length(normalized), 112 upper_count=self.upper_count(normalized), 113 lower_count=self.lower_count(normalized), 114 digit_count=self.digit_count(normalized), 115 is_alpha=self.is_alpha(normalized), 116 is_ascii=self.is_ascii(normalized), 117 is_digit=self.is_digit(normalized), 118 is_lower=self.is_lower(normalized), 119 is_upper=self.is_upper(normalized), 120 is_title=self.is_title(normalized), 121 is_punct=self.is_punct(normalized), 122 is_hyphenated=self.is_hyphenated(normalized), 123 like_url=self.like_url(normalized), 124 like_number=self.like_number(normalized), --> 125 cluster=self.cluster(normalized) 126 )

File ~/anaconda3/envs/patent_py38/lib/python3.8/site-packages/chemdataextractor/nlp/lexicon.py:141, in Lexicon.cluster(self, text) 139 """""" 140 if not self._loaded_clusters and self.clusters_path: --> 141 self.clusters = load_model(self.clusters_path) 142 self._loaded_clusters = True 143 return self.clusters.get(text, None)

File ~/anaconda3/envs/patent_py38/lib/python3.8/site-packages/chemdataextractor/data.py:162, in load_model(path) 160 try: 161 with io.open(abspath, 'rb') as f: --> 162 model = six.moves.cPickle.load(f) 163 except IOError: 164 raise ModelNotFoundError('Could not load %s. Have you run cde data download?' % path)

EOFError: Ran out of input

Kind regards, Sanghoon Lee

OBrink commented 1 year ago

Hello Sanghoon,

If I understand this right, the error occurs when trying to load a model that ChemDataExtractor needs for tokenization. This seems odd. Have you tried running cde data download in your command shell as suggested by the error message?

I have to say, that I don't know how to fix this in your environment. I have given up on trying to set up an environment with the current version of ChemDataExtractor on my computer and I am running it using Docker now. Working with a Docker image that comes with the preinstalled package with all dependencies has the advantage that it's an encapsulated environment that is pre-configured and just works out of the box.

I have written some instructions on how to work with the ChemDataExtractor Docker image. After installing Docker it should work as described there. I hope this helps.

Kind regards, Otto

CambridgeMolecularEngineering / chemdataextractor2

ModelNotFoundError for punkt_chem-1.0.pickle #33