NorskRegnesentral / skweak

skweak: A software toolkit for weak supervision applied to NLP tasks
MIT License
917 stars 71 forks source link

TypeError: unhashable type: 'list' #45

Closed AlineBornschein closed 2 years ago

AlineBornschein commented 2 years ago

Upon applying config file in order to train textcat model using the following code:

!spacy init config - --lang en --pipeline ner --optimize accuracy | \ spacy train - --paths.train ./train.spacy --paths.dev ./train.spacy \ --initialize.vectors en_core_web_md --output train

I receive following error message:

[i] Saving to output directory: train [i] Using CPU

=========================== Initializing pipeline =========================== 2022-03-27 15:49:59.778883: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found 2022-03-27 15:49:59.778913: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2022-03-27 15:49:59.798942: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found 2022-03-27 15:49:59.798976: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. [2022-03-27 15:50:05,376] [INFO] Set up nlp object from config [2022-03-27 15:50:05,395] [INFO] Pipeline: ['tok2vec', 'ner'] [2022-03-27 15:50:05,395] [INFO] Created vocabulary [2022-03-27 15:50:07,968] [INFO] Added vectors: en_core_web_md [2022-03-27 15:50:08,292] [INFO] Finished initializing nlp object Traceback (most recent call last): File "C:\ProgramData\Anaconda3\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\ProgramData\Anaconda3\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\ProgramData\Anaconda3\Scripts\spacy.exe__main.py", line 7, in File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli_util.py", line 71, in setup_cli command(prog_name=COMMAND) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 829, in call return self.main(args, kwargs) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 782, in main rv = self.invoke(ctx) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1066, in invoke return ctx.invoke(self.callback, ctx.params) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 610, in invoke return callback(args, kwargs) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\typer\main.py", line 497, in wrapper return callback(use_params) # type: ignore File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli\train.py", line 45, in train_cli train(config_path, output_path, use_gpu=use_gpu, overrides=overrides) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli\train.py", line 72, in train nlp = init_nlp(config, use_gpu=use_gpu) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\initialize.py", line 84, in init_nlp nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\language.py", line 1308, in initialize proc.initialize(get_examples, nlp=self, **p_settings) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\pipeline\tok2vec.py", line 215, in initialize validate_get_examples(get_examples, "Tok2Vec.initialize") File "spacy\training\example.pyx", line 65, in spacy.training.example.validate_get_examples File "spacy\training\example.pyx", line 44, in spacy.training.example.validate_examples File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 142, in call for real_eg in examples: File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 164, in make_examples for reference in reference_docs: File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 199, in read_docbin for doc in docs: File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\tokens_serialize.py", line 150, in get_docs doc.spans.from_bytes(self.span_groups[i]) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\tokens_dict_proxies.py", line 54, in from_bytes group = SpanGroup(doc).from_bytes(value_bytes) File "spacy\tokens\span_group.pyx", line 170, in spacy.tokens.span_group.SpanGroup.from_bytes File "C:\ProgramData\Anaconda3\lib\site-packages\srsly_msgpack_api.py", line 27, in msgpack_loads msg = msgpack.loads(data, raw=False, use_list=use_list) File "C:\ProgramData\Anaconda3\lib\site-packages\srsly\msgpack\init__.py", line 79, in unpackb return _unpackb(packed, **kwargs) File "srsly\msgpack_unpacker.pyx", line 191, in srsly.msgpack._unpacker.unpackb TypeError: unhashable type: 'list'

Seems like a dependency issue. What is the reason for it? And is there a way to fix it?

Also : Is the following error message a problem ? "[E1010] Unable to set entity information for token 10 which is included in more than one span in entities, blocked, missing or outside." or can it be avoided by simply applying the following?: for document in train_data: try: document.ents = document.spans["hmm"] skweak.utils.docbin_writer(train_data, "train.spacy") except Exception as e: print(e)

plison commented 2 years ago

I don't seem to manage to reproduce the error unfortunately. What is the version of spacy and skweak that you are using? Is it perhaps the case that the train.spacy data has been generated by an older version that is no longer compatible?

AlineBornschein commented 2 years ago

Thanks for the quick reply! I'm using the latest versions. skweak==0.3.1 spacy==3.2.3

plison commented 2 years ago

Could you let me know if the example on https://github.com/NorskRegnesentral/skweak/blob/main/examples/quick_start.ipynb works for you (including in particular the last part that runs the spacy training script)?

AlineBornschein commented 2 years ago

Hi, the example skript works for me, apart from the last line of code, where the spacy model is trained. This part is not finishing.

As for my classification model, I assume that there are conflicts between my labelling functions as some of them might overlap and label the same spans. Maybe that could be why the writing to spacy doc results in the error : "ValueError: [E1010] Unable to set entity information for token 10 which is included in more than one span in entities, blocked, missing or outside." I resolved this by filtering these conflicts out. The error unhashable list still prevails even when run on a different machine.

plison commented 2 years ago

I don't really know what might cause this problem, unfortunately. Could you send me a minimal example I could test?

nleguillarme commented 1 year ago

Hi, I don't know if it may be related, but I have the same error : TypeError: unhashable type: 'list'

Here is how I obtain the error:

Then in an other script:

It seems that the error is caused by voting.MajorityVoter, since I do not have the error when removing the majority voter from my pipeline.

Here is the full trace

Traceback (most recent call last):
  File "fit_model.py", line 48, in <module>
    docs = get_docs(db_path)
  File "fit_model.py", line 29, in get_docs
    docs = list(db.get_docs(nlp.vocab))
  File "/home/leguilln/workspace/nlp/corpus_annotation/skweak-corpus-annot/src/skweak-env/lib/python3.8/site-packages/spacy/tokens/_serialize.py", line 152, in get_docs
    doc.spans.from_bytes(self.span_groups[i])
  File "/home/leguilln/workspace/nlp/corpus_annotation/skweak-corpus-annot/src/skweak-env/lib/python3.8/site-packages/spacy/tokens/_dict_proxies.py", line 96, in from_bytes
    group = SpanGroup(doc).from_bytes(value_bytes)
  File "spacy/tokens/span_group.pyx", line 223, in spacy.tokens.span_group.SpanGroup.from_bytes
  File "/home/leguilln/workspace/nlp/corpus_annotation/skweak-corpus-annot/src/skweak-env/lib/python3.8/site-packages/srsly/_msgpack_api.py", line 27, in msgpack_loads
    msg = msgpack.loads(data, raw=False, use_list=use_list)
  File "/home/leguilln/workspace/nlp/corpus_annotation/skweak-corpus-annot/src/skweak-env/lib/python3.8/site-packages/srsly/msgpack/__init__.py", line 79, in unpackb
    return _unpackb(packed, **kwargs)
  File "srsly/msgpack/_unpacker.pyx", line 191, in srsly.msgpack._unpacker.unpackb
TypeError: unhashable type: 'list'