Closed AlineBornschein closed 2 years ago
I don't seem to manage to reproduce the error unfortunately. What is the version of spacy
and skweak
that you are using? Is it perhaps the case that the train.spacy
data has been generated by an older version that is no longer compatible?
Thanks for the quick reply! I'm using the latest versions. skweak==0.3.1 spacy==3.2.3
Could you let me know if the example on https://github.com/NorskRegnesentral/skweak/blob/main/examples/quick_start.ipynb works for you (including in particular the last part that runs the spacy training script)?
Hi, the example skript works for me, apart from the last line of code, where the spacy model is trained. This part is not finishing.
As for my classification model, I assume that there are conflicts between my labelling functions as some of them might overlap and label the same spans. Maybe that could be why the writing to spacy doc results in the error : "ValueError: [E1010] Unable to set entity information for token 10 which is included in more than one span in entities, blocked, missing or outside." I resolved this by filtering these conflicts out. The error unhashable list still prevails even when run on a different machine.
I don't really know what might cause this problem, unfortunately. Could you send me a minimal example I could test?
Hi, I don't know if it may be related, but I have the same error : TypeError: unhashable type: 'list'
Here is how I obtain the error:
Then in an other script:
It seems that the error is caused by voting.MajorityVoter, since I do not have the error when removing the majority voter from my pipeline.
Here is the full trace
Traceback (most recent call last):
File "fit_model.py", line 48, in <module>
docs = get_docs(db_path)
File "fit_model.py", line 29, in get_docs
docs = list(db.get_docs(nlp.vocab))
File "/home/leguilln/workspace/nlp/corpus_annotation/skweak-corpus-annot/src/skweak-env/lib/python3.8/site-packages/spacy/tokens/_serialize.py", line 152, in get_docs
doc.spans.from_bytes(self.span_groups[i])
File "/home/leguilln/workspace/nlp/corpus_annotation/skweak-corpus-annot/src/skweak-env/lib/python3.8/site-packages/spacy/tokens/_dict_proxies.py", line 96, in from_bytes
group = SpanGroup(doc).from_bytes(value_bytes)
File "spacy/tokens/span_group.pyx", line 223, in spacy.tokens.span_group.SpanGroup.from_bytes
File "/home/leguilln/workspace/nlp/corpus_annotation/skweak-corpus-annot/src/skweak-env/lib/python3.8/site-packages/srsly/_msgpack_api.py", line 27, in msgpack_loads
msg = msgpack.loads(data, raw=False, use_list=use_list)
File "/home/leguilln/workspace/nlp/corpus_annotation/skweak-corpus-annot/src/skweak-env/lib/python3.8/site-packages/srsly/msgpack/__init__.py", line 79, in unpackb
return _unpackb(packed, **kwargs)
File "srsly/msgpack/_unpacker.pyx", line 191, in srsly.msgpack._unpacker.unpackb
TypeError: unhashable type: 'list'
Upon applying config file in order to train textcat model using the following code:
!spacy init config - --lang en --pipeline ner --optimize accuracy | \ spacy train - --paths.train ./train.spacy --paths.dev ./train.spacy \ --initialize.vectors en_core_web_md --output train
I receive following error message:
[i] Saving to output directory: train [i] Using CPU
=========================== Initializing pipeline =========================== 2022-03-27 15:49:59.778883: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found 2022-03-27 15:49:59.778913: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2022-03-27 15:49:59.798942: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found 2022-03-27 15:49:59.798976: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. [2022-03-27 15:50:05,376] [INFO] Set up nlp object from config [2022-03-27 15:50:05,395] [INFO] Pipeline: ['tok2vec', 'ner'] [2022-03-27 15:50:05,395] [INFO] Created vocabulary [2022-03-27 15:50:07,968] [INFO] Added vectors: en_core_web_md [2022-03-27 15:50:08,292] [INFO] Finished initializing nlp object Traceback (most recent call last): File "C:\ProgramData\Anaconda3\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\ProgramData\Anaconda3\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\ProgramData\Anaconda3\Scripts\spacy.exe__main.py", line 7, in
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli_util.py", line 71, in setup_cli
command(prog_name=COMMAND)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 829, in call
return self.main(args, kwargs)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 782, in main
rv = self.invoke(ctx)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1066, in invoke
return ctx.invoke(self.callback, ctx.params)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 610, in invoke
return callback(args, kwargs)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\typer\main.py", line 497, in wrapper
return callback(use_params) # type: ignore
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli\train.py", line 45, in train_cli
train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli\train.py", line 72, in train
nlp = init_nlp(config, use_gpu=use_gpu)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\initialize.py", line 84, in init_nlp
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\language.py", line 1308, in initialize
proc.initialize(get_examples, nlp=self, **p_settings)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\pipeline\tok2vec.py", line 215, in initialize
validate_get_examples(get_examples, "Tok2Vec.initialize")
File "spacy\training\example.pyx", line 65, in spacy.training.example.validate_get_examples
File "spacy\training\example.pyx", line 44, in spacy.training.example.validate_examples
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 142, in call
for real_eg in examples:
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 164, in make_examples
for reference in reference_docs:
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 199, in read_docbin
for doc in docs:
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\tokens_serialize.py", line 150, in get_docs
doc.spans.from_bytes(self.span_groups[i])
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\tokens_dict_proxies.py", line 54, in from_bytes
group = SpanGroup(doc).from_bytes(value_bytes)
File "spacy\tokens\span_group.pyx", line 170, in spacy.tokens.span_group.SpanGroup.from_bytes
File "C:\ProgramData\Anaconda3\lib\site-packages\srsly_msgpack_api.py", line 27, in msgpack_loads
msg = msgpack.loads(data, raw=False, use_list=use_list)
File "C:\ProgramData\Anaconda3\lib\site-packages\srsly\msgpack\init__.py", line 79, in unpackb
return _unpackb(packed, **kwargs)
File "srsly\msgpack_unpacker.pyx", line 191, in srsly.msgpack._unpacker.unpackb
TypeError: unhashable type: 'list'
Seems like a dependency issue. What is the reason for it? And is there a way to fix it?
Also : Is the following error message a problem ? "[E1010] Unable to set entity information for token 10 which is included in more than one span in entities, blocked, missing or outside." or can it be avoided by simply applying the following?:
for document in train_data: try: document.ents = document.spans["hmm"] skweak.utils.docbin_writer(train_data, "train.spacy") except Exception as e: print(e)