andrewtavis / kwx

BERT, LDA, and TFIDF based keyword extraction in Python
BSD 3-Clause "New" or "Revised" License
67 stars 10 forks source link

TEXT Cleaning #49

Open mgg-new opened 1 year ago

mgg-new commented 1 year ago

module 'emoji' has no attribute 'get_emoji_regexp'

andrewtavis commented 1 year ago

Hey @mgg-new! Thanks for letting me know :) This is likely something to do with a new version of emoji. kwx is set up with version 1.2.0, and they're now on 2.2.0. Would you have interest in helping with this? It should actually be an easy fix where we just figure out what the new name for get_emoji_regexp is and do the update :)

Thanks again!

mgg-new commented 1 year ago

https://carpedm20.github.io/emoji/docs/ The function get_emoji_regexp() was removed in 2.0.0. Internally the module no longer uses a regular expression when scanning for emoji in a string (e.g. in demojize()). The regular expression was slow in Python 3 and it failed to correctly find certain combinations of long emoji (emoji consisting of multiple Unicode codepoints). If you used the regular expression to remove emoji from strings, you can use replace_emoji() as shown in the examples above. If you want to extract emoji from strings, you can use emoji_list() as a replacement. If you want to keep using a regular expression despite its problems, you can create the expression yourself

andrewtavis commented 1 year ago

Thanks, @mgg-new! Appreciate you taking the time to detail it all. I think I'll have a bit more bandwidth in about a week or so to look into all this :) I'll be in touch! 😊

andrewtavis commented 1 year ago

@mgg-new, I updated emoji in the dependancies and changed the spot where get_emoji_regexp was used to the new method. Thanks for the research you put into this :) I just released v1.0.2 to account for this shift. There was an error in the tests for the PR, but then the local ones passed, so as of now I'm not really going to worry about it.

andrewtavis commented 1 year ago

Will leave this issue open in case there are future issues related to that failed test or other related issues 🙂