To lemmatize or not to lemmatize?

atsyplenkov commented 4 days ago

Hi, guys. Thank you for your research. It is extremely interesting and valuable for the community, and I mean it! I am curious why didn’t you use lemmatization or stemming of the words prior to analysis? Is it only due to the increased computational power required, or is there another reason I am missing?

From my perspective, your current approach may potentially underestimate the frequency ratio of some words. For example, from Figure 2, it is clear that the frequency of the word "delve" should be higher, as both "delves" and "delved" are presented in the figure.

I am asking because I am planning to conduct similar research with the Earth Science manuscripts and finding excess words specific for my domain.

dkobak commented 4 days ago

To be honest, the main reason was "for simplicity", but one secondary reason was that we thought it may actually be interesting to look at all forms separately -- e.g. "delves", "delved" and "delve" may increase their usage by a different amount (because ChatGPT may prefer to use a specific form particularly often).

In retrospect I think it would actually be more sensible to lemmatize everything. We may change the analysis in future revisions, or possibly add a supplementary analysis with/without lemmatization. Depends also on how the peer review process will go.

It should be relatively straightforward, smth like this code given here https://scikit-learn.org/stable/modules/feature_extraction.html#tips-and-tricks:

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

vectorizer = CountVectorizer(tokenizer=LemmaTokenizer())

Let me know if you try it out!

atsyplenkov commented 4 days ago

Thanks for that, I will let you know! Good luck with the peer-review.

berenslab / chatgpt-excess-words

To lemmatize or not to lemmatize? #1