cligs / tmw

Topic Modeling Workflow in Python
16 stars 7 forks source link

le, mí, era, eras in lemmata lists #13

Closed morethanbooks closed 9 years ago

morethanbooks commented 9 years ago

Many of the wordles did have words that are clearly not nouns in Spanish, like: -le -mi -eras -fue -creo -sé ...

At the beginning I thought it was a mistake of the treetagger. But now I think that the problem is in the make_lemmatext function. There, in the "for mode == "esN"" : its said that if there is an option (|) in the lemma column, the word should be taken. If we go to the tagged documents, we confirm that this words that are not nouns do have an option in its lemma, like: le PPC él|le eras VLfin erar|ser creo VLfin crear|creer|creer sé VLfin saber|ser fue VLfin ir|ser mi PPO mi|mío

As we can see, all this example are words that is ambiguous to which lemma belongs, but that is not ambiguous their POS and it is clear that are not nouns. I don't know the result of the treetagger for French, but in general I would say that we shouldn't set for Spanish that if the lemma is ambiguous should be taken for the topic modelling.

christofs commented 9 years ago

The intended logic is: if the lemma is ambigous and the POS is a noun, then take the word form. The second condition is indeed missing in the code. I suggest to use the following:

elif mode == "esN":
    if "|" in lemma and "NC" in pos:
        lemmata.append(token.lower())
christofs commented 9 years ago

Please test the latest version!

morethanbooks commented 9 years ago

Now it works :). But there we are only using common nouns and not proper nouns. Do we want that? I thought topic modelling works normally with proper nouns as well...

christofs commented 9 years ago

It may work, but it makes a lot of sense not to use the character names (included among the proper nouns), because otherwise you get a lot of character name topics. So this is really project-dependent.