R1j1t / contextualSpellCheck

✔️Contextual word checker for better suggestions
MIT License
405 stars 56 forks source link

[BUG] #59

Closed BradenAnderson closed 3 years ago

BradenAnderson commented 3 years ago

Describe the bug

I apologize in advance if this issue I am about to describe is simply some kind of user error rather than an actual issue. Please understand this is my first time using Spacy and contextualSpellCheck and I believe I am using them correctly however there is always the chance I am not.

That said, my application is using contextualSpellCheck to check the spelling in tweets and recommend fixes for misspelled words. In doing this, I have found that the spelling corrections are almost always incorrect, and often times completely illogical.

For example, in the tweet:

"@user all #smiles when #media is silly joke flirt mischief excitement #pressconference in #antalya #turkey sunday #throwback love happy happy love happy love"

contextual spell check indicates the word "flirt" is misspelled (which it is not) and recommends the illogical spelling correction of "#".

Please see the image below for a few more examples.

-- illogical_spelling_corrections

I have a dataset of over 20k tweets, and have created a function that will process a given number of these tweets using Spacy and contextualSpellCheck. The function stores all top spelling correction options in a csv file. Using my function (link provided below) you can easily reproduce this issue and create as many examples of these incorrect spelling suggestions as needed (using my code will make it very easy to produce the problem).

To Reproduce

#Steps to reproduce the behavior:

1. Download the colab notebook here: https://github.com/BradenAnderson/Twitter-Sentiment-Analysis/blob/main/01_2_Data_Cleaning_Spacy_and_Spellcheck.ipynb

2. Download the tweet data here: https://github.com/BradenAnderson/Twitter-Sentiment-Analysis/blob/main/Train_Test_Datasets/train_tweets_with_emojis_clean.csv

3. Download two more supporting data files these two links:

https://github.com/BradenAnderson/Twitter-Sentiment-Analysis/blob/main/Supporting_Data_Files/contractions.csv

https://github.com/BradenAnderson/Twitter-Sentiment-Analysis/blob/main/Supporting_Data_Files/sms_speak.csv

4. Run the colab notebook. The test driver is called in cell 47, and will generate an output csv file that shows you all the recommendations contextualSpellCheck made. You can change the test "start" and "end" index in the test drive function call, and that will run the test of a different set of tweets and create a new csv file. 

5. Inspect the output csv file to determine if the spelling suggestions are reasonable. output csv file will have a name formatted as:

num1_to_num2_spellcheck_test_results.csv 

where num1 and num2 are the start and end tweet indexes you passed to the test drive function.

Expected behavior I expected the spelling recommendations to be at least reasonable. In many cases spelling suggestions involve changing a correctly spelled readable word to simply a punctuation mark like "." or "#".

Version (please complete the following information):

Additional information

Please be aware that the way I have structured the code to capture all of this information about what contextualSpellCheck is doing requires a lot of RAM. I do not recommend running the test driver function on more than 50 tweets at one time.

As I mentioned at the beginning I am inexperienced with both Spacy and contextualSpellCheck. I have shared my current use case and implementation and I welcome any advice on how to better use these tools. Beyond getting spellcheck working, I want to find a way to process all of these tweets without generating so many doc objects, as I believe this is what is causing so much RAM usage, which has led to slow performance and crashes.

Despite all of that, as far as I can tell the contextualSpellCheck is not giving reasonable spelling recommendations for this tweet processing application. I have dedicated a significant amount of time to troubleshooting this, and I finally decided to raise the issue with you guys. I would really like to use your tool if it can be used for this task. Please help me understand if this is an actual bug, or some kind of user error.

Thank you, Braden

R1j1t commented 3 years ago

Hi @BradenAnderson I will be honest here, based on some cases which I checked I thought it was working okay not great though.

Your statement:

That said, my application is using contextualSpellCheck to check the spelling in tweets and recommend fixes for misspelled words. In doing this, I have found that the spelling corrections are almost always incorrect, and often times completely illogical.

I will not deny this claim. There is still a lot of work to be done and that is why the versions is 0.x. Possible investigation points for very bad correction could be the [MASK] filling bert model and spell error identification (pending unfixed #44 ). At present it defaults to bert-base-cased link. The current logic for spelling correction is as follows:

  1. provide spacy model: This will break the sentence into tokens. Now as this model is trained on a particular language (tweet specific models are also there) it knows the nuances
  2. Check the token in the transformer model's vocab: If the token is not present consider it spelling error
  3. Mask the OOV word and use the transformers model to predict words to replace mask
  4. check the edit distance to see which is closest syntactically.

The default model (bert-base-cased) might not be trained entirely on the tweets for which the syntax is very different from text in news for example. Expecting the library to work out of the box on the entire set of use cases if a big ask, try to debug code (it is not that complex) or if you want you can create a fork and play around with the code to see what is causing the issue.

BradenAnderson commented 3 years ago

Hi Raajat,

Thanks for providing this response. I was honestly thinking that that after I described the issue, that whoever responded would be pointing out some error I had made in my implementation that was leading to these results.

I have tried to debug with no luck, but I will look through the links and information you provide and try again. If I have some extra time it would be fun to fork and try to do some more in depth debugging.

Hope my original post didn't come across as being harsh. I was really thinking that I had just been using the library wrong so in my description I was trying to give lots of detail on what I had done and the results. Like you said, you are taking on a big challenge as building something like this is no easy task. If I can't get it working this time around I'd be happy to try again sometime in the future.

I'll keep looking at it and if I come up with any insight I'll send over another message. Thanks again for the response.

-Braden

R1j1t commented 3 years ago

@BradenAnderson, I finally got the time to look at your question again. I think you should try to pass one of these models to contextualSpellCheck. See the below pseudo code for reference:

nlp = spacy.load("en_core_web_sm")
# max_edit_dist is a good parameter to play with and see the effect on the results
nlp.add_pipe("contextual spellchecker",config={"model_name": "vinai/bertweet-base","max_edit_dist": 4})

sent = html.unescape(<SENTENCE>)
doc = nlp(sent)

I programmatically checked for 15 sentences and below are the result in the table :

Output Table | ID in Dataset | original sentence | misspell | suggested | |---------------|-----------------------------------------------------------------------------------------------------------------------------------------|---------|-----------| | 18223 | the places i'll go. #photography #smileyface #smile #eggs #toilet #flush #food… | | | | 9544 | #siilyfaces #family #cousins #love #lasvegas #fremontstreet @ freemont st expeirence | | | | 10450 | 2of2 needs to know about you being the problem and not the solution for our black community, @user like mateen. @user | | | | 11018 | about to study for my next few #speeches and work on product development. | | | | 6658 | astounded with #catfish. no one knows who they are really talking to over the net. #weird #naive #lonely #sick | | | | 4562 | a sign of the evil to come... #epic #anger #shout #man #manga awork by anna riley | awork | art | | 4029 | yayy!!! that show i definitely on my list of things!!! | | | | 2519 | @user @user what a wonderful photo, i like the spontaneity of expressions | | | | 21900 | i miss my boyfriend already #relationshipgoals #missing | | | | 5333 | about to sta @user book thinking differently. families at my school are reading, too! #education | | | | 2893 | @user #micommunity - launching on 20th june @user | | | | 9225 | there becomes a time in life where you just got to stop caring and just go with what's meant to happen and be at peace with it. # truth | | | | 17287 | can #lighttherapy help with or #depression? #altwaystoheal #healthy is #happy !! | | | | 24312 | i imagine it would be a lot like this. #imaginary #conversations #life #style #lifestyle… | | | | 1558 | missed having cute costa dates;) | | |

For the example you mentioned earlier:

doc = nlp("@user all #smiles when #media is silly joke flirt mischief excitement #pressconference in #antalya #turkey sunday #throwback love happy happy love happy love")
print(doc._.suggestions_spellCheck)    #{antalya: 'italy'}
print(doc._.outcome_spellCheck)        #@user all #smiles when #media is silly joke flirt mischief excitement #pressconference in #italy #turkey sunday #throwback love happy happy love happy love

I hope this would work for your use-case, try out different models and different edit distances as well. Do let me know your observations.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.