incorrect result while running on large dataset

un-lock-me commented 2 years ago

Hello,

I am trying your tools and I experienced a weird bug. I really appreciate it if you can share your thought regarding this issue with me. I have a dataset of let's say 1000 instances(Some are positive, some negative, and the rest neutral). When I run the tools on the csv file only a portion of each category will be labeled correctly! For example, "Great place" will be labeled positive but "GREAT!" will be labeled Neutral. And if I remove the "Great place" instance from the dataset then "Great" will be labeled positive!!!!

So, I have tried different scenarios to find the bug and the only conclusion I could make is that it does not work when the number of samples increases. But I don't get why??

I tried another scenario as well. I kept the code run on top of the CSV file and have the result saved on the CSV file. Then, I pass just "GREAT!" to the model right after finishing labeling of CSV file. It labeled it as neutral again!! (If I pass "GREAT!" before running the model on the csv file then it label it as "Positive") which kinda confirmed what I said earlier.

Could you please share with me what could be the reason? The code seems very straightforward I don't know why this is happening?

Thanks in advance @cjhutto

cjhutto commented 2 years ago

Hi @un-lock-me ... this does seem strange, indeed. 1000 instances should be extremely easy for VADER (I and others routinely use it for files with thousands and millions of records). Would you mind sharing a sample of the structure of the CSV file and your pipeline/code to show how you are parsing and processing the CSV file?

Siddharth-Latthe-07 commented 2 months ago

@un-lock-me , The vader module works on the basis of finding the lexical meaning of the phrases and then providing the scores between -1 and +1. There might be different sentiment outputs for words sending individually(Great, place) or sending it in a phrase(Great Place), to the model. Apart from this, the difference in the sentiment output for word great, is sought of related to how the model processes the word with symbols, like words like Great! and Great might have different sentiment scores, though the word is same, but their lexical meaning might differ. Hope this helps Thanks.

cjhutto / vaderSentiment

incorrect result while running on large dataset #134