davidberenstein1957 / concise-concepts

This repository contains an easy and intuitive approach to few-shot NER using most similar expansion over spaCy embeddings. Now with entity scoring.
MIT License
241 stars 15 forks source link

Still unable to pass in a custom Gensim model #10

Closed akshaydevml closed 1 year ago

akshaydevml commented 2 years ago

Raised an issue earlier regarding the same problem and @davidberenstein1957 committed a fix and posted this code block as solution

import spacy from spacy import displacy

import concise_concepts

data = { "fruit": ["apple", "pear", "orange"], "vegetable": ["broccoli", "spinach", "tomato", "garlic", "onion", "beans"], "meat": ["beef", "pork", "fish", "lamb", "bacon", "ham", "meatball"], "dairy": ["milk", "butter", "eggs", "cheese", "cheddar", "yoghurt", "egg"], "herbs": ["rosemary", "salt", "sage", "basil", "cilantro"], "carbs": ["bread", "rice", "toast", "tortilla", "noodles", "bagel", "croissant"], }

text = """ Heat the oil in a large pan and add the Onion, celery and carrots. Then, cook over a medium–low heat for 10 minutes, or until softened. Add the courgette, garlic, red peppers and oregano and cook for 2–3 minutes. Later, add some oranges and chickens. """

model_path = "word2vec.model"

nlp = spacy.load("en_core_web_md", disable=["ner"]) nlp.add_pipe( "concise_concepts", config={ "data": data, "model_path": model_path, "ent_score": True, }, ) doc = nlp(text)

options = { "colors": { "fruit": "darkorange", "vegetable": "limegreen", "meat": "salmon", "dairy": "lightblue", "herbs": "darkgreen", "carbs": "lightbrown", }, "ents": ["fruit", "vegetable", "meat", "dairy", "herbs", "carbs"], }

ents = doc.ents for ent in ents: newlabel = f"{ent.label} ({float(ent.ent_score):.0%})" options["colors"][new_label] = options["colors"].get(ent.label.lower(), None) options["ents"].append(newlabel) ent.label = new_label doc.ents = ents

displacy.render(doc, style="ent", options=options)

However, I am still getting the 'Word2vec object is not iterable error'.

Could you please look into it?

davidberenstein1957 commented 2 years ago

Hello,

I feel this was resolved by installing the dependencies required by the package. Gensim >= 4.

Regards, David

On 5 Jun 2022, at 15:24, akshaydevml @.***> wrote:

 Raised an issue earlier regarding the same problem and @davidberenstein1957 committed a fix and posted this code block as solution

import spacy from spacy import displacy

import concise_concepts

data = { "fruit": ["apple", "pear", "orange"], "vegetable": ["broccoli", "spinach", "tomato", "garlic", "onion", "beans"], "meat": ["beef", "pork", "fish", "lamb", "bacon", "ham", "meatball"], "dairy": ["milk", "butter", "eggs", "cheese", "cheddar", "yoghurt", "egg"], "herbs": ["rosemary", "salt", "sage", "basil", "cilantro"], "carbs": ["bread", "rice", "toast", "tortilla", "noodles", "bagel", "croissant"], }

text = """ Heat the oil in a large pan and add the Onion, celery and carrots. Then, cook over a medium–low heat for 10 minutes, or until softened. Add the courgette, garlic, red peppers and oregano and cook for 2–3 minutes. Later, add some oranges and chickens. """

model_path = "word2vec.model"

nlp = spacy.load("en_core_web_md", disable=["ner"]) nlp.add_pipe( "concise_concepts", config={ "data": data, "model_path": model_path, "ent_score": True, }, ) doc = nlp(text)

options = { "colors": { "fruit": "darkorange", "vegetable": "limegreen", "meat": "salmon", "dairy": "lightblue", "herbs": "darkgreen", "carbs": "lightbrown", }, "ents": ["fruit", "vegetable", "meat", "dairy", "herbs", "carbs"], }

ents = doc.ents for ent in ents: newlabel = f"{ent.label} ({float(ent.ent_score):.0%})" options["colors"][new_label] = options["colors"].get(ent.label.lower(), None) options["ents"].append(newlabel) ent.label = new_label doc.ents = ents

displacy.render(doc, style="ent", options=options)

However, I am still getting the 'Word2vec object is not iterable error'.

Could you please look into it?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

akshaydevml commented 2 years ago

I am using Gensim 4.2.0 and sill getting the error, tried in multiple different environments, still the same error

davidberenstein1957 commented 2 years ago

Could you send me some reproducible code and files you are using?

On 5 Jun 2022, at 15:57, akshaydevml @.***> wrote:

 I am using Gensim 4.2.0 and sill getting the error, tried in multiple different environments, still the same error

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

akshaydevml commented 2 years ago

Sure, here is the code snipped I used

import pandas as pd df = pd.read_csv('IMDB Dataset.csv')

from gensim.models.phrases import Phrases, Phraser from gensim.models import Word2Vec sent = [row.split() for row in df['review']] phrases = Phrases(sent, min_count=30, progress_per=10000) bigram = Phraser(phrases) sentences = bigram[sent]

from gensim.models import Word2Vec w2v_model = Word2Vec(min_count=20, window=2, vector_size=200,
sample=6e-5, alpha=0.03, min_alpha=0.0007, negative=20, ) w2v_model.build_vocab(sentences, progress_per=10000) w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=10, report_delay=1) w2v_model.save("film.model")

import spacy from spacy import displacy import concise_concepts nlp = spacy.load('en_core_web_md', disable=["ner"]) data = { "fruit": ["apple", "pear", "orange"], "vegetable": ["broccoli", "spinach", "tomato"], "meat": ["beef", "pork", "fish", "lamb"] }

model_path = "film.model"

nlp.add_pipe("concise_concepts", config={"data": data, "model_path": model_path})

prakhar251998 commented 2 years ago

Hi David, I am facing the same error while trying to pass my custom trained word2vec model.Have tried every scenario which you had posted earlier.Have even reffered to the word2vec model doccumentation to train my model as prescribed.Even then getting the error. Even for this code snippet

import spacy from spacy import displacy import concise_concepts data = { "display":["pixel","resolution","touchscreen"], "performace":['multitask','processor','graphics','ram','hang'], "storage":["internal","memory","expandable"], "camera" :["focus","resolution","flash","photos"], "Battery":["capacity","quick","charging"], "connectivity":['gps','bluetooth','wifi','sim'], "sensors":["light","proximity","compass","gyroscope"]
}

text = '''believe me, it's the slowest mobile I saw. Don't go on screen and Battery, it is an extremely slow mobile phone and takes ages to open and navigate. Forget about heavy use, it can't handle normal regular use. I made a huge mistake but pls don't buy this mobile. It's only a few months and I am thinking to change it. Its dam SLOW SLOW SLOW. '''

from gensim.test.utils import common_texts from gensim.models import Word2Vec model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4) model.save("word2vec.model")

model_path = "Word2vec.model" nlp = spacy.load("en_core_web_lg", disable=['ner'])

ent_score for entity condifence scoring

nlp.add_pipe("concise_concepts", config={"data": data,"model_path": model_path}) doc = nlp(text)

Error:

~\anaconda3\lib\site-packages\concise_concepts\conceptualizer\Conceptualizer.py in verify_data(self, verbose) 107 for key, value in self.data.items(): 108 verifiedvalues = [] --> 109 if key.replace(" ", "") not in self.kv: 110 if verbose: 111 logger.warning(f"key {key} not present in word2vec model")

TypeError: argument of type 'Word2Vec' is not iterable

davidberenstein1957 commented 1 year ago

I'm taking a look this week.

GenVr commented 1 year ago

@prakhar251998 I also have this problem. Have you solved it somehow?

prakhar251998 commented 1 year ago

Not yet @GenVr.Waiting for @davidberenstein1957 update fix on this part

davidberenstein1957 commented 1 year ago

Hello,I made some initial progress last week but I will be able to wrap it up coming week. Regards,David On 20 Sep 2022, at 07:41, prakhar251998 @.***> wrote: Not yet @GenVr.Waiting for @davidberenstein1957 update fix on this part

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

GenVr commented 1 year ago

@davidberenstein1957 Thanks.

First

I don't know if it can help you, I have gensim==4.2.0, I have seen (very fast) the Conceptualizer.py library and it seems that several times (in the functions as verify_data(), expand_concepts()...etc) the error is due to an iteration like:

if key.replace ("", "_") not in self.kv

However, where self.kv is not the vocab keys (I don't know if this code expect to find the vocab keys as self.kv)

I tried to replace this iteration with:

keys_list = list (self.kv.wv.key_to_index.keys())
...
if key.replace ("", "_") not in keys_list:
   ...

This happens multiple times in the library.

There are also other errors, such as; self.kv.most_similar

that need to be:

self.kv.wv.most_similar

and others like this.

Even by correcting these errors, all works but the model mismatches my words.

Second

Then, I have a question if possible. I'm new with Gensim, I noticed that the key of the given dictionary must necessarily be in the Word2Vec vocab.

Example:

data = {
    "word A": ["house", "home", ...],
    "word B": ['display', 'smartphone', ...],
}

model = Word2Vec(sentences=common_texts, ...)

...

nlp = spacy.load("en_core_web_lg", disable=['ner'])
nlp.add_pipe("concise_concepts", config={"data": data, "ent_score": True, "model_path": model_path})

So word A and word B need to be in the model vocab. Otherwise, I have a key not found error. The initial training sentences need these keys in it I guess?

Thanks

davidberenstein1957 commented 1 year ago

I just resolved this. @GenVr @prakhar251998 @akshaydevml thank you for the input!

GenVr commented 1 year ago

@davidberenstein1957 thanks. I have tried this code (with your new changes) but still have the error reported at the end.

import spacy
from spacy import displacy
import concise_concepts
from gensim.test.utils import common_texts
from gensim.models import Word2Vec

data = {
"display":["pixel","resolution","touchscreen"],
"performace":['multitask','processor','graphics','ram','hang'],
"storage":["internal","memory","expandable"],
"camera" :["focus","resolution","flash","photos"],
"Battery":["capacity","quick","charging"],
"connectivity":['gps','bluetooth','wifi','sim'],
"sensors":["light","proximity","compass","gyroscope"]
}

text = '''believe me, it's the slowest mobile I saw. Don't go on screen and Battery, it is an extremely slow mobile phone and takes ages to open and navigate. Forget about heavy use, it can't handle normal regular use. I made a huge mistake but pls don't buy this mobile. It's only a few months and I am thinking to change it. Its dam SLOW SLOW SLOW.
'''

model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
model_path = "word2vec.model"

nlp = spacy.load("en_core_web_lg", disable=['ner'])
nlp.add_pipe("concise_concepts", config={"data": data,"model_path": model_path})

Error:


WARNING:concise_concepts.conceptualizer.Conceptualizer:key display not present in word2vec model
WARNING:concise_concepts.conceptualizer.Conceptualizer:word pixel from key display not present in word2vec model
WARNING:concise_concepts.conceptualizer.Conceptualizer:word resolution from key display not present in word2vec model
WARNING:concise_concepts.conceptualizer.Conceptualizer:word touchscreen from key display not present in word2vec model

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
[<ipython-input-5-4778ce6d6aae>](https://localhost:8080/#) in <module>
      1 nlp = spacy.load("en_core_web_lg", disable=['ner'])
----> 2 nlp.add_pipe("concise_concepts", config={"data": data,"model_path": model_path})

[/usr/local/lib/python3.7/dist-packages/concise_concepts/conceptualizer/Conceptualizer.py](https://localhost:8080/#) in verify_data(self, verbose)
    182                 verified_values
    183             ), f"None of the entries for key {key} are present in the word2vec model"
--> 184         self.data = deepcopy(verified_data)
    185         self.original_data = deepcopy(self.data)
    186 

AssertionError: None of the entries for key display are present in the word2vec model
davidberenstein1957 commented 1 year ago

Hello,

This is actually expected behaviour, since you are trying to match a label and words that are not present in the trained wor2vec model.

You initially get warning regarding the missing keys and words, but since none of the data is available in the model, it actually raises an error.

It did let me to find another small implementation error with the ngram support, so keep the feedback comming!

Regards, David

On 26 Sept 2022, at 14:11, GennaroV @.***> wrote:

@davidberenstein1957 https://github.com/davidberenstein1957 thanks. I have tried this code (with your new changes) but still have the error reported at the end.

import spacy from spacy import displacy import concise_concepts from gensim.test.utils import common_texts from gensim.models import Word2Vec

data = { "display":["pixel","resolution","touchscreen"], "performace":['multitask','processor','graphics','ram','hang'], "storage":["internal","memory","expandable"], "camera" :["focus","resolution","flash","photos"], "Battery":["capacity","quick","charging"], "connectivity":['gps','bluetooth','wifi','sim'], "sensors":["light","proximity","compass","gyroscope"] }

text = '''believe me, it's the slowest mobile I saw. Don't go on screen and Battery, it is an extremely slow mobile phone and takes ages to open and navigate. Forget about heavy use, it can't handle normal regular use. I made a huge mistake but pls don't buy this mobile. It's only a few months and I am thinking to change it. Its dam SLOW SLOW SLOW. '''

model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4) model.save("word2vec.model") model_path = "word2vec.model"

nlp = spacy.load("en_core_web_lg", disable=['ner']) nlp.add_pipe("concise_concepts", config={"data": data,"model_path": model_path}) Error:

WARNING:concise_concepts.conceptualizer.Conceptualizer:key display not present in word2vec model WARNING:concise_concepts.conceptualizer.Conceptualizer:word pixel from key display not present in word2vec model WARNING:concise_concepts.conceptualizer.Conceptualizer:word resolution from key display not present in word2vec model WARNING:concise_concepts.conceptualizer.Conceptualizer:word touchscreen from key display not present in word2vec model


AssertionError Traceback (most recent call last) in 1 nlp = spacy.load("en_core_web_lg", disable=['ner']) ----> 2 nlp.add_pipe("concise_concepts", config={"data": data,"model_path": model_path})

/usr/local/lib/python3.7/dist-packages/concise_concepts/conceptualizer/Conceptualizer.py in verify_data(self, verbose) 182 verified_values 183 ), f"None of the entries for key {key} are present in the word2vec model" --> 184 self.data = deepcopy(verified_data) 185 self.original_data = deepcopy(self.data) 186

AssertionError: None of the entries for key display are present in the word2vec model — Reply to this email directly, view it on GitHub https://github.com/Pandora-Intelligence/concise-concepts/issues/10#issuecomment-1257937286, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGAZHZEQDQFLA3BPQOFFXADWAGHGLANCNFSM5X5BV66A. You are receiving this because you were mentioned.