how to return impact associated to each word per sentence

un-lock-me commented 2 years ago

Hi @bfelbo , and first of all thanks for sharing your great work.

I have a dataset in which the domain is a little bit different than twitter. I have a couple of questions and would really appreciate it if you could help me with this.

For the start, I fined tuned my dataset and got the accuracy. However, what is important for me is to be able to find out the impact of the words per sentence (The same highlight that you have in the demo) For example:

"This disease is very dangerous" Not only I have the label as negative but also it gives the weight associated with "dangerous".

I saw this PR (https://github.com/bfelbo/DeepMoji/pull/8) is that what I need? if so could you please give some information on what I need to do in order to get what I want?

I changed the param attention_weight in attlayer script to TRUE but nothing happened in the output.

Again thanks so much for the great work!

un-lock-me commented 2 years ago

@bfelbo could you please shed some light on this?

un-lock-me commented 2 years ago

That would be really appreciated how can we use this feature. I was able to fine tune my dataset. However, when I set return_attention True in the model_def.py script it raises this error:

Traceback (most recent call last): File "finetune_youtube_last.py", line 35, in <module> data['batch_size'], method='last') File "path/DeepMoji/deepmoji/finetuning.py", line 385, in finetune evaluate=metric, verbose=verbose) File "path/DeepMoji/deepmoji/finetuning.py", line 442, in tune_trainable callbacks=callbacks, verbose=(verbose >= 2)) File "/path/.conda/envs/deepmoji/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 87, in wrapper return func(*args, **kwargs) File "/path/.conda/envs/deepmoji/lib/python2.7/site-packages/keras/engine/training.py", line 2013, in fit_generator val_x, val_y, val_sample_weight) File "/path/.conda/envs/deepmoji/lib/python2.7/site-packages/keras/engine/training.py", line 1413, in _standardize_user_data exception_prefix='target') File "/path/.conda/envs/deepmoji/lib/python2.7/site-packages/keras/engine/training.py", line 121, in _standardize_input_data 'Found: array with shape ' + str(data.shape)) ValueError: The model expects 2 target arrays, but only received one array. Found: array with shape (778, 3)

I really appreciate it if you could have a quick look and let me know your thought @bfelbo

un-lock-me commented 2 years ago

By any chance do you have any idea why this is happening @ryanleary ? Thanks~

un-lock-me commented 2 years ago

If it does help I used the youtube dataset that is available in your source code and ran the finetune_youtube_last.py and getting the exact same error. What I mean is that even by running the code provided and using the dataset in your source code can not get the attention weight!!

I did change the return_attention = True before changing this the code is running smoothly and getting 0.90 accuracy but once I change this variable it raises the errro.

I really appreciate it if you could share anything with me that can help to troubleshoot this @ryanleary @bfelbo

somul18 commented 2 years ago

You can find an example of how to compute the emotional impact words at: https://github.com/somul18/DeepMoji/blob/master/examples/score_texts_emojis_aw.py Hope this is helpful. @un-lock-me

un-lock-me commented 2 years ago

@somul18 Thanks so much for getting back to me. I really appreciate it. When I want to run your script it needs "deepmoji_weights.hdf5" however its not available. Could you please let me know if I need to do anything before running this script?

un-lock-me commented 2 years ago

@somul18 Please ignore my previous question (I had forgotten I clone the repository again and the weight I had downloaded were gone).

Just a quick question and I really appreciate it if you can confirm it:

So lets say I pass this example and got this result: [u' And they didn't like it . That is bad ', 0.62056778371334076, 55, 37, 32, 1, 43, 0.20648907, 0.17305556, 0.14249285, 0.049436379, 0.049093928, u'bad', u'that', u'is', u'they', u'like']

I can see that that received high score here. not only this example but I pass many other examples and that , this ... got high score too. Do you have any idea why this is it?

Again thanks so much for sharing your code here. this helped me a lot!

somul18 commented 2 years ago

In my opinion, you should read it as a trigram 'that is frustrating' and bigram.'they hung'. If you want only monogram like 'frustrating' , 'hung', you should implement a filter for stop words before reporting relevant words. Hope this helps. @un-lock-me

un-lock-me commented 2 years ago

@somul18 Sorry Thanks again for sharing your thought with me. I went ahead and wanted to remove stop words before passing the data to the model however for some weird reason I kept getting ValueError: All sentences should be Unicode-encoded! error in this line tokenized, _, _ = st.tokenize_sentences(TEST_SENTENCES).

Unfortunately the code is in python2 and that's why I cannot install the new version of pandas, and consequently I cannot use encoding_error = while reading the csv file I have. And when I pass the file without removing stop words it works perfectly but when I add this line of the code it raises the error I mentioned above df['text_without_stopwords'] = df['text'].apply(lambda x: ' '.join([word.encode('latin1', 'ignore').decode('latin1') for word in x.split() if word not in (stop)])) TEST_SENTENCES = df['text_without_stopwords']

I have tried to read the file with encoding='utf-8' and latin-1 and 2 more encodings but nothing changed. Just if it helps when I use a 10 sample of my data and did the same thing as above it worked but apparently there are some bad samples in the dataset that causes this error. and I cannot think any other way to fix it.

I really appreciate it if you can share with me if you have any thought with me. Thanks again.

somul18 commented 2 years ago

@un-lock-me My suggestion is to use astype df['column'] = df['column'].astype('unicode')

un-lock-me commented 2 years ago

Wowww I cannot believe this worked! why is that different from the way I treated it? and why is this happening only when I do stop words removal?

Thanks a loootttt, very much appreciate it.

un-lock-me commented 2 years ago

@somul18 really sorry if I am spamming you. I was experimenting on my dataset that something got my attention.

I passed my dataset (it has three labels 1.positive 2. negative 3.neutral) to the fine-tune_youtube_last.py script and added model.save('new_model.h5) at the end of that script to save the model.

Then I updated your script score_texts_emojis_aw.py to load this new model. I expected that the code will run smoothly. However this error arises:

return getattr(obj, method)(*args, **kwds) ValueError: kth(=-2) out of bounds (3)

in this line of the code ind = np.argpartition(array, -k)[-k:]

Now I am confused. isn't that the top words we select does not related to the number of labels we have? if so, what could be the source of this error?

Thanks a lot again, really appreciate your time!

un-lock-me commented 2 years ago

I think I am lost. I reviewed the code and it seems to me it should not depend of=n the number of labels (top_elements is for tp words per sentence?). And by this logic it should be ok but still raises error even if I used the same data you provided in the script, but loading the weights I trained on my dataset with three labels.

I understand that I am taking so much time from you and I am really sorry. But would you please have a quick look and share with me your thought please @somul18?

Thanks so much~

somul18 commented 2 years ago

@un-lock-me . No sure if you are filtering the stop words before applying the model. My suggestion is to filtering the stop words after applying the model and reporting your results. Hope this helps. Maybe you can post the code so I can review it.

un-lock-me commented 2 years ago

Thanks a Lott for your response. Thats what I did. this is the steps:

trained on my data using this script fine-tune_youtube_last.py (removed stop words)
I used the exact same script you have with the exact same data you have on the script. The only difference is that I load the new weight that has been trained on my data with three labels.

Im not sure which script should I share because I used the same script in the repository for the training but the only difference is the dataset with the three labels. when you try IMDB dataset which has three labels you would face with the same error! @somul18

un-lock-me commented 2 years ago

So this is the only line I added to your script:

PRETRAINED_PATH = 'model_new.h5' model = deepmoji_emojis(maxlen, PRETRAINED_PATH, return_attention=True)

And model_new.h5 is the model trained on a dataset with three labels using finetune_youtube_last.py script.

And this is model.save('model_new.h5') I added to finetune_youtube_last.py script to save my model.

Also, I changed the code a little bit for reading the data as my data was in csv format but I don't think that has anything with this error. But if that helps I can share that as well. @somul18

un-lock-me commented 2 years ago

I trained my on data without removing the stop words and the same error raises. So it seems that when I reload a new weight that I have trained it myself it raises this error. @somul18

bfelbo / DeepMoji

how to return impact associated to each word per sentence #68