CentreForDigitalHumanities / tscan

T-scan: an analysis tool for dutch texts to assess the complexity of the text, based on original work by Rogier Kraf
GNU Affero General Public License v3.0
18 stars 6 forks source link

improve lemmatisation #63

Open lukavdplas opened 1 year ago

lukavdplas commented 1 year ago

There are some noticable inaccuracies in the output from the frog lemmatiser (such as *heden not being lemmatised to *heid), perhaps we can improve the lemmatisation.

One option is to add a different lemmatisation service that can be used instead of frog. We should investigate if there is a lemmatiser for Dutch with significantly better results.

Another option is to use some combination of the frog and alpino output for the final lemmatisation. Suggestion from @oktaal

Interessant genoeg lijkt de Alpino-parse in dit geval wel "gedweeheid" als lemma te vinden maar die informatie wordt niet gebruikt in T-Scan.

Wat ik zou kunnen doen is het lemma-attribuut van de Alpino-parse te gebruiken als (1) het lemma van frog hetzelfde is als het woord (dus geen lemmatisering) en (2) het lemma van Alpino wel afwijkt. Als beide een lemma hebben dat afwijkt van het woord dan is Frog leidend. Als Frog correct ziet dat het lemma hetzelfde is als het woord en Alpino er toch wat anders van heeft gemaakt dan introduceert dat dan wel een fout.

Ik vraag me af in hoeverre hier nieuwe problemen kunnen ontstaan, idealiter zou je dit willen kunnen evalueren. Misschien moet dit een optie worden (lemma-informatie: alleen Frog (nu het geval), alleen Alpino, Frog met Alpino-fallback, Alpino met Frog-fallback).

kosloot commented 1 year ago

Some remarks: We all know Frog isn't perfect, but it already knows about 880 different heden to heid lemma's, which isn't that bad. But the Frog lemmatizer does miss some (not all) versions outside that 880 It isn't that hard to train extra lemma's into frogs datafiles. All you need is a list of word <tab> lemma <tab> POStag cases. In that way you could improve rather quickly imho, Background: The lemmatizer is trained on sentences from CGN with additions from an extra list of know lemma's which can easily be expanded. With other missing lemmas too. Training can be done using froggen. Not a very difficult task, once used to it :P If you have a list in the right format, I am willing to do this task, IFF we may use this data to add to the Frog project as a whole. (see also the froggen manual )

lukavdplas commented 1 year ago

Thanks! I'm not very familliar with frog myself, so I did not know that it could be retrained. This may be the best way forward for us, what do you think, @oktaal ?

oktaal commented 1 year ago

Definitely. Thank you! We are going to collect a new list for Frog. You will see it some time in the future @kosloot

kosloot commented 1 year ago

looking forward.In the meantime I improved on froggen a bit, to make everybody's life mor comfortable.

NB: the POS tags must be from the CGN set.

oktaal commented 7 months ago

@kosloot hi! It's been a bit over a year, but here we finally have a list: https://github.com/CentreForDigitalHumanities/dutch-plurals/blob/main/output.tsv (I also have some time to look into it now again, it's been sitting there for a while). I'm not quite sure if I also need to add rows just containing singulars, so e.g. it only has rows like this:

EK's    EK  N(eigen,meervoud,basis)

I could easily add rows such as:

EK  EK  N(eigen,enkelvoud,basis)

if needed.

Words which cannot be pluralized such as "adrenaline", "marmer", "toedoen", etc only have a row such as:

adrenaline  adrenaline  N(basis,enkelvoud,basis)

I'm not sure if these need to be marked in some other way.

kosloot commented 7 months ago

Thanks. I hope to be able to look into this "rsn". Some remarks after skimming the data:

werkenelektro-encefalograaf N(basis,meervoud,basis)

Frog is already trained with:

elektro-encefalografen elektro-encefalograaf N(soort,mv,basis)

so your entry

  1. will loose the 'soort' tag,
  2. used 'basis' 2 times, which might make the software barf (not tested)
  3. You use 'meervoud' where CGN uses 'mv' (based on the CGN tags as defined by VanEynde. 2004)

So maybe it is wise to reconsider this list a bit

kosloot commented 7 months ago

Additional remark:; One of the lines reads:

Puerto Rico Puerto Rico N(eigen,enkelvoud,basis)

But multiword entries are NOT supported, so this entry will be skipped

oktaal commented 5 months ago

It's been awhile again, but I've updated the list to no longer have multiword entries (there were just two) and modified the tags to be compliant with VanEynde's format. I wonder if I also need to add word gender?

kosloot commented 5 months ago

Ok, we are getting close. But there are still a few problems

  1. a lot of words are tagged as: N(soort,ev,basis), but vanEijnde uses a more fine-grained:

    • [T101] N(soort,ev,basis,zijd,stan) die stoel, deze muziek, de filter
    • [T102] N(soort,ev,basis,onz,stan) het kind, ons huis, het filter
    • [T104] N(soort,ev,basis,gen) 's avonds, de heer des huizes
    • [T106] N(soort,ev,basis,dat) ter plaatse, heden ten dage
    • [U117] N(soort,ev,basis,genus,stan) een riool, geen filter I could probably modify Frog to do 'fuzzy matching' where N(soort,ev,basis) matches N(soort,ev,basis,onz,stan) but that is a lot of work, and might have an unknown impact.
  2. a lot of proper names are tagged as: N(eigen,ev,basis) here too, vanEijnde is more specific

    • [T109] N(eigen,ev,basis,zijd,stan) de Noordzee, de Kemmelberg, Karel
    • [T110] N(eigen,ev,basis,onz,stan) het Hageland, het Nederlands
    • [T112] N(eigen,ev,basis,gen) des Heren, Hagelands trots
    • [T114] N(eigen,ev,basis,dat) wat den Here toekomt
    • [U118] N(eigen,ev,basis,genus,stan) Linux, Esselte

    BUT, the Frog tagger is trained on data where all proper names are tagged as SPEC(deeleigen) so this is a nice shortcut I would advice. To be clear: Whenever the Tagger tags a word as SPEC(deeleigen) the lemmatizer will take a shortcut, and will use the word as the lemma. Effectively ignoring the lemma assigned in de lemmata data.

  3. one entry is ietsje ietsje N(soort,ev,dim)' this tag is also not known, but probably N(soort,ev,dim,onz,stan) will do?

That's all folks

kosloot commented 5 months ago

Ok, it is even more complex then I thought. I took a better look at the second case of N(eigen,mv,basis) words And not ALL of them should be (exclusively) tagged a SPEC(deeleigen). There is also a range of N(soort,mv,basis) tags among them. Words like Zuid-Molukkers and Zwitsers

And others are to seen as an ADJ. not as an N. This are cases like Zuid-Nederlandse Zuid-Nederlands ADJ(prenom,basis,met-e,stan) , Zuid-Amerikaans Zuid-Amerikaaans ADJ(prenom,basis,zonder) Zuid-Amerikaanse Zuid-Amerikaans ADJ(prenom,basis,met-e,stan)

To cite vanEynde:

Nominaal (of zelfstandig) gebruikte adjectieven worden niet als substantie- ven behandeld, maar als adjectieven.

(page 20 of my copy)

BUT: some of these cases are ambiguous, we will also need: Zuid-Amerikaanse Zuid-Amerikaanse SPEC(deeleigen) or Zuid-Chinese Zuid-Chinese SPEC(deeleigen)

This is quite clumsy, but that's the historic way