blmoistawinde / HarvestText

文本挖掘和预处理工具(文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等),无监督或弱监督方法
MIT License
2.42k stars 329 forks source link

[BUG] - Fix `entity_discover()` function parameter bug #37

Closed francis-du closed 3 years ago

francis-du commented 3 years ago

Code:

em_dict, et_dict, mention_count = ht.entity_discover(text="\n".join(comments), return_count=True, min_count=3,
                                                     method="NFL",
                                                     threshold=0.97)
all_mentions = set(x for enty, ments in em_dict.items() for x in ments)
print(f"Num entities: {len(em_dict)}, Num mentions: {len(all_mentions)}")

Error:

Doing NER
100%|██████████| 1303/1303 [00:03<00:00, 403.01it/s]
Training fasttext
Traceback (most recent call last):
  File "/Users/francisdu/Code/Python/pass_comments/passenger/entity_discover.py", line 18, in <module>
    threshold=0.97)
  File "/Users/francisdu/.pyenv/versions/3.6.9/lib/python3.6/site-packages/harvesttext/word_discover.py", line 232, in entity_discover
    min_count, pinyin_tolerance, self.pinyin_adjlist, **kwargs)
  File "/Users/francisdu/.pyenv/versions/3.6.9/lib/python3.6/site-packages/harvesttext/algorithms/entity_discoverer.py", line 134, in __init__
    min_n, max_n)
  File "/Users/francisdu/.pyenv/versions/3.6.9/lib/python3.6/site-packages/harvesttext/algorithms/entity_discoverer.py", line 150, in train_emb
    id2word = [wd for wd in id2word if wd in model.wv.vocab]
  File "/Users/francisdu/.pyenv/versions/3.6.9/lib/python3.6/site-packages/harvesttext/algorithms/entity_discoverer.py", line 150, in <listcomp>
    id2word = [wd for wd in id2word if wd in model.wv.vocab]
  File "/Users/francisdu/.pyenv/versions/3.6.9/lib/python3.6/site-packages/gensim/models/keyedvectors.py", line 646, in vocab
    "The vocab attribute was removed from KeyedVector in Gensim 4.0.0.\n"
AttributeError: The vocab attribute was removed from KeyedVector in Gensim 4.0.0.
Use KeyedVector's .key_to_index dict, .index_to_key list, and methods .get_vecattr(key, attr) and .set_vecattr(key, attr, new_val) instead.
See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4

These changes come from changes since gensim

  1. Migrating-from-Gensim-3.x-to-4
francis-du commented 3 years ago

@blmoistawinde Hey bro, Do you have time to review these change?

blmoistawinde commented 3 years ago

Thanks for the PR! And I also added a judgement for backward compatibility.

francis-du commented 3 years ago

Good job