CambridgeMolecularEngineering / chemdataextractor2

ChemDataExtractor Version 2.0
Other
121 stars 28 forks source link

Check_records #55

Closed jghasemi44 closed 2 months ago

jghasemi44 commented 3 months ago

I used the first version cde but after installation of cde-2 I faced some errors like check_records is not a module or submodule. Please let me know how can I tackle this issue.

Dingyun-Huang commented 3 months ago

Please post your code snippet and exception traceback so that we could reproduce your situation.

sparrowcolab commented 3 months ago

I'm having similar problem, I was trying to run the example for extracting_a_custom_property.ipynb, First let me explain to you how I installed the CDE2 and how it runs on my machine, I use spyder 5.0.0 with python 3.7.12, this is the only version of python and spyder where the installation of CDE2 and spyder doesn't cause dependecies conflicts (in my case)

in the begining there was a lot of difficulties to install becasue of urllib3 and ssl, then finally managed to get it work and managed to import Document, then tried to run: d = Document( Heading(u'Synthesis of 2,4,6-trinitrotoluene (3a)'), Paragraph(u'The procedure was followed to yield a pale yellow solid (boiling point 240 °C)') ) as in the example, but there was problem with the 'AND' in the chemdataextractor.doc.Document.py, then I made a lazy import of 'custm And' as follows:

_custom_and = None

def get_custom_and(): global _custom_and if _custom_and is None: from custom_and import CustomAnd _custom_and = CustomAnd return _custom_and

it worked around for a while and the 'And' doesn't contain 'Flatten' module disappred, then after running my code for a while, started to get error message about import circulation, then I commented the lazy import and things worked very well,

but when I always try to get the following: [In] d [Out] Synthesis of 2,4,6-trinitrotoluene (3a) The procedure was followed to yield a pale yellow solid (boiling point 240 °C), it never work like that from the very begining, when I type d, the output is [Out]<Document: 2 elements>, then I have to use the following loop to extract the results: for element in d.elements: print(element.text)

Synthesis of 2,4,6-trinitrotoluene (3a) The procedure was followed to yield a pale yellow solid (b.p. 240 °C)

then everytime I move to records step, I get an error related to pikcle and serialize as follows:

d.records.serialize() itialising AllenNLP model . Automatically activating GPU support [?25hTraceback (most recent call last):

File "C:\Users\Amgad Ahmed\AppData\Local\Temp\ipykernel_7368\1354135793.py", line 1, in d.records.serialize()

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\document.py", line 276, in records element_definitions = el.definitions

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\text.py", line 360, in definitions return [definition for sent in self.sentences for definition in sent.definitions]

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\text.py", line 360, in return [definition for sent in self.sentences for definition in sent.definitions]

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\utils.py", line 29, in fget_memoized setattr(self, attr_name, fget(self))

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\text.py", line 771, in definitions for result in self.specifier_definition.scan(tokens):

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\parse\elements.py", line 117, in scan results, next_i = self.parse(tokens, i)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\parse\elements.py", line 146, in parse result, found_index = self._parse_tokens(tokens, i, actions)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\parse\elements.py", line 427, in _parse_tokens exprresults, i = e.parse(tokens, i)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\parse\elements.py", line 146, in parse result, found_index = self._parse_tokens(tokens, i, actions)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\parse\elements.py", line 656, in _parse_tokens results, i = self.expr.parse(tokens, i, actions)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\parse\elements.py", line 146, in parse result, found_index = self._parse_tokens(tokens, i, actions)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\parse\elements.py", line 551, in _parse_tokens result, result_i = e.parse(tokens, i, actions=True)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\parse\elements.py", line 146, in parse result, found_index = self._parse_tokens(tokens, i, actions)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\parse\elements.py", line 297, in _parse_tokens tag = token[self.tag_type]

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\text.py", line 1220, in getitem return self.legacy_pos_tag

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\text.py", line 1210, in legacy_pos_tag ner_tag = self[NER_TAG_TYPE]

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\text.py", line 1222, in getitem return self.getattr(key)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\text.py", line 1230, in getattr self.sentence._assign_tags(name)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\text.py", line 832, in _assign_tags self.document._batch_assign_tags(tagger, tag_type)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\document.py", line 718, in _batch_assign_tags tag_results = tagger.batch_tag_for_type(all_tokens, tag_type)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\nlp\tag.py", line 206, in batch_tag_for_type return tagger.batch_tag(sents)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\nlp\allennlpwrapper.py", line 194, in batch_tag batch_predictions = self.predictor.predict_batch_instance(instance)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\nlp\allennlpwrapper.py", line 153, in predictor overrides=json.dumps(self.overrides))

File "C:\Amg\envs\ide_env\lib\site-packages\allennlp\models\archival.py", line 230, in load_archive cuda_device=cuda_device)

File "C:\Amg\envs\ide_env\lib\site-packages\allennlp\models\model.py", line 327, in load return cls.by_name(model_type)._load(config, serialization_dir, weights_file, cuda_device)

File "C:\Amg\envs\ide_env\lib\site-packages\allennlp\models\model.py", line 265, in _load model = Model.from_params(vocab=vocab, params=model_params)

File "C:\Amg\envs\ide_env\lib\site-packages\allennlp\common\from_params.py", line 365, in from_params return subclass.from_params(params=params, **extras)

File "C:\Amg\envs\ide_env\lib\site-packages\allennlp\common\from_params.py", line 386, in from_params kwargs = create_kwargs(cls, params, **extras)

File "C:\Amg\envs\ide_env\lib\site-packages\allennlp\common\from_params.py", line 133, in create_kwargs kwargs[name] = construct_arg(cls, name, annotation, param.default, params, **extras)

File "C:\Amg\envs\ide_env\lib\site-packages\allennlp\common\from_params.py", line 229, in construct_arg return annotation.from_params(params=subparams, **subextras)

File "C:\Amg\envs\ide_env\lib\site-packages\allennlp\common\from_params.py", line 365, in from_params return subclass.from_params(params=params, **extras)

File "C:\Amg\envs\ide_env\lib\site-packages\allennlp\modules\text_field_embedders\basic_text_field_embedder.py", line 160, in from_params for name, subparams in token_embedder_params.items()

File "C:\Amg\envs\ide_env\lib\site-packages\allennlp\modules\text_field_embedders\basic_text_field_embedder.py", line 160, in for name, subparams in token_embedder_params.items()

File "C:\Amg\envs\ide_env\lib\site-packages\allennlp\common\from_params.py", line 365, in from_params return subclass.from_params(params=params, **extras)

File "C:\Amg\envs\ide_env\lib\site-packages\allennlp\common\from_params.py", line 388, in from_params return cls(**kwargs) # type: ignore

File "C:\Amg\envs\ide_env\lib\site-packages\allennlp\modules\token_embedders\bert_token_embedder.py", line 270, in init model = PretrainedBertModel.load(pretrained_model)

File "C:\Amg\envs\ide_env\lib\site-packages\allennlp\modules\token_embedders\bert_token_embedder.py", line 38, in load model = BertModel.from_pretrained(model_name)

File "C:\Amg\envs\ide_env\lib\site-packages\pytorch_pretrained_bert\modeling.py", line 590, in from_pretrained archive.extractall(tempdir)

File "C:\Amg\envs\ide_env\lib\tarfile.py", line 2002, in extractall numeric_owner=numeric_owner)

File "C:\Amg\envs\ide_env\lib\tarfile.py", line 2044, in extract numeric_owner=numeric_owner)

File "C:\Amg\envs\ide_env\lib\tarfile.py", line 2114, in _extract_member self.makefile(tarinfo, targetpath)

File "C:\Amg\envs\ide_env\lib\tarfile.py", line 2163, in makefile copyfileobj(source, target, tarinfo.size, ReadError, bufsize)

File "C:\Amg\envs\ide_env\lib\tarfile.py", line 247, in copyfileobj buf = src.read(bufsize)

File "C:\Amg\envs\ide_env\lib\gzip.py", line 287, in read return self._buffer.read(size)

File "C:\Amg\envs\ide_env\lib_compression.py", line 68, in readinto data = self.read(len(byte_view))

File "C:\Amg\envs\ide_env\lib\gzip.py", line 493, in read raise EOFError("Compressed file ended before the "

EOFError: Compressed file ended before the end-of-stream marker was reached

The problem I used to excute all this on CDE 1.3.0 very smoothly, the only problem that I needed to use GPU which forced me to migrate to CDE 2.3.2

so, could anyone help me with this error please

Dingyun-Huang commented 3 months ago

Hi there,

I am sorry that you are experiencing some problems migrating from CDE1 to CDE2, but I could not reproduce your error on my laptop. I would suggest create a new fresh environment and re-run the installation using pip install.

conda create -n env_name python=3.7
conda activate env_name
pip install chemdataextractor2

If you still experience some problems please post again.

Best wishes, Dingyun

sparrowcolab commented 3 months ago

Dear Dingyun,

Thank you for taking the time to support,

Before I wrote my previous comment, I tried

conda create -n env_name python=3.7 conda activate env_name pip install chemdataextractor2,

several times, with python=3.6, 3.6, 3.8, always the problem about the urllib3 and ssl exists and confilicting dependcies as six, click and requests versions appeared several times, by downgrading and cleaning cach I managed to get the CDE2 working, the main problem now is now at d.records.serialize, when it starts to inisialize the BERT in chemdataextractor.nlp.allenenlp, it gives the following error:

File "C:\Amg\envs\ide_env\lib\gzip.py", line 493, in read raise EOFError("Compressed file ended before the "

EOFError: Compressed file ended before the end-of-stream marker was reached,

I'm new to GIT, so if you tell me how I can help you to reproduce the problem, I can give you more helpful data,

Finally, thank you for trying to help,

Best Regards, sparrow

Dingyun-Huang commented 2 months ago

Hi Sparrow,

Can you run cde data download and re-try your programme?

sparrowcolab commented 2 months ago

Dear Dingyun,

I tried to run cde data download at the conda CMD console, but got stuck at the following message: Couldn't find models/bert_finetuned_crf_model-1.0a, downloading... Couldn't find models/scibert_cased_weights-1.0.tar.gz, downloading..., Then after period of time, got the following message: Successfully downloaded 0 new data packages (22 existing)

but I tried to run my CDE2, and fourtanatly this time when I ran: for record in doc.records: res = record.serialize

The AllenNLP initialized and started to run for bit more seconds untill I get the following error: itialising AllenNLP model . Automatically activating GPU support Initialising AllenNLP model ...✔ Download and installation successful You can now load the model via spacy.load('en_core_web_sm') ✘ Couldn't link model to 'en_core_web_sm' Creating a symlink in spacy/data failed. Make sure you have the required permissions and try re-running the command as admin, or use a virtualenv. You can still import the model as a module and call its load() method, or create the symlink manually. C:\Amg\envs\ide_env\lib\site-packages\en_core_web_sm --> C:\Amg\envs\ide_env\lib\site-packages\spacy\data\en_core_web_sm [?25hTraceback (most recent call last):

File "C:\Amg\envs\ide_env\lib\site-packages\allennlp\common\util.py", line 289, in get_spacy_model spacy_model = spacy.load(spacy_model_name, disable=disable)

File "C:\Amg\envs\ide_env\lib\site-packages\spacy__init__.py", line 27, in load return util.load_model(name, **overrides)

File "C:\Amg\envs\ide_env\lib\site-packages\spacy\util.py", line 139, in load_model raise IOError(Errors.E050.format(name=name))

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "C:\Users\Amgad Ahmed\AppData\Local\Temp\ipykernel_8584\3059345307.py", line 1, in for record in doc.records:

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\document.py", line 276, in records element_definitions = el.definitions

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\text.py", line 360, in definitions return [definition for sent in self.sentences for definition in sent.definitions]

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\text.py", line 360, in return [definition for sent in self.sentences for definition in sent.definitions]

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\utils.py", line 29, in fget_memoized setattr(self, attr_name, fget(self))

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\text.py", line 771, in definitions for result in self.specifier_definition.scan(tokens):

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\parse\elements.py", line 117, in scan results, next_i = self.parse(tokens, i)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\parse\elements.py", line 146, in parse result, found_index = self._parse_tokens(tokens, i, actions)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\parse\elements.py", line 427, in _parse_tokens exprresults, i = e.parse(tokens, i)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\parse\elements.py", line 146, in parse result, found_index = self._parse_tokens(tokens, i, actions)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\parse\elements.py", line 656, in _parse_tokens results, i = self.expr.parse(tokens, i, actions)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\parse\elements.py", line 146, in parse result, found_index = self._parse_tokens(tokens, i, actions)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\parse\elements.py", line 551, in _parse_tokens result, result_i = e.parse(tokens, i, actions=True)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\parse\elements.py", line 146, in parse result, found_index = self._parse_tokens(tokens, i, actions)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\parse\elements.py", line 297, in _parse_tokens tag = token[self.tag_type]

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\text.py", line 1220, in getitem return self.legacy_pos_tag

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\text.py", line 1210, in legacy_pos_tag ner_tag = self[NER_TAG_TYPE]

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\text.py", line 1222, in getitem return self.getattr(key)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\text.py", line 1230, in getattr self.sentence._assign_tags(name)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\text.py", line 832, in _assign_tags self.document._batch_assign_tags(tagger, tag_type)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\document.py", line 718, in _batch_assign_tags tag_results = tagger.batch_tag_for_type(all_tokens, tag_type)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\nlp\tag.py", line 206, in batch_tag_for_type return tagger.batch_tag(sents)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\nlp\allennlpwrapper.py", line 194, in batch_tag batch_predictions = self.predictor.predict_batch_instance(instance)

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\nlp\allennlpwrapper.py", line 158, in predictor self._predictor = copy.deepcopy(SentenceTaggerPredictor(model=model, dataset_reader=None))

File "C:\Amg\envs\ide_env\lib\site-packages\allennlp\predictors\sentence_tagger.py", line 26, in init self._tokenizer = SpacyWordSplitter(language=language, pos_tags=True)

File "C:\Amg\envs\ide_env\lib\site-packages\allennlp\data\tokenizers\word_splitter.py", line 173, in init self.spacy = get_spacy_model(language, pos_tags, parse, ner)

File "C:\Amg\envs\ide_env\lib\site-packages\allennlp\common\util.py", line 302, in get_spacy_model link(spacy_model_name, spacy_model_name, model_path=package_path)

File "C:\Amg\envs\ide_env\lib\site-packages\spacy\cli\link.py", line 65, in link symlink_to(link_path, model_path)

File "C:\Amg\envs\ide_env\lib\site-packages\spacy\compat.py", line 96, in symlink_to ["mklink", "/d", path2str(orig), path2str(dest)], shell=True

File "C:\Amg\envs\ide_env\lib\subprocess.py", line 363, in check_call raise CalledProcessError(retcode, cmd)

CalledProcessError: Command '['mklink', '/d', 'C:\Amg\envs\ide_env\lib\site-packages\spacy\data\en_core_web_sm', 'C:\Amg\envs\ide_env\lib\site-packages\en_core_web_sm']' returned non-zero exit status 1.,

I belive some lib are missing but I don't know how I can get it installed, and no matter, how many times i tried to reinstall CDE2 even in different machines, it didn't go further from todays results

sparrowcolab commented 2 months ago

I then managed to download the en_core_web_sm==2.1.0, from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.

Then when reached:

doc.models = [CurrentDensity]) the GPU was automatically activated as follows: itialising AllenNLP model . Automatically activating GPU support itialising AllenNLP model ✔ ... and CDE2 keep running for about 4 mins with GPU utilization ranging between 40 to 100%, and then got the following error message:

doc.models = [CurrentDensity] itialising AllenNLP model . Automatically activating GPU support itialising AllenNLP model ✔ ... Traceback (most recent call last):

File "C:\Users\Amgad Ahmed\AppData\Local\Temp\ipykernel_8584\4033503847.py", line 8, in for record in doc.records:

File "C:\Amg\envs\ide_env\lib\site-packages\chemdataextractor\doc\document.py", line 393, in records for contextual_record in contextual_records:

NameError: name 'contextual_records' is not defined

Dingyun-Huang commented 2 months ago

Hi there,

There isn't an active line of for contextual_record in contextual_records in document.py in the source code. You can check the source code file here.

If you have edited the source code, please try revert that change. If you haven't, then run pip show chemdataextractor2 in your conda environment and note the version of CDE2. The newest version is 2.3.2.

Best, Dingyun

sparrowcolab commented 2 months ago

Hi, I think the problem is partially solved, CDE2 now is running without triggring erro messages, although the outcome results are not same as CDE1 in terms of etracted data, when you suggested to run cde data download, I got the message that files don't exist, when I checkd the pages mannually , I found 404 error page can't found, then I manually searched for the crf_bert, and other missing files and installed it manually and it worked.

Concerning the contextual records, you are correct, I corrected some modificationes I did to the source code and everything is apparently running well,

Thank you so much for your help, I might seek your help again if any new issues poped up,

Best regards, Sparrow