dracor-org / fredracor

French Drama Corpus
5 stars 1 forks source link

unresolved pair characters with 'et' in many French plays #20

Open DanilSko opened 1 year ago

DanilSko commented 1 year ago

FreDraCor networks seem to be plagued with glued-together characters like "Ugande et Alcif", "Corisande et Florestan". In some 17 century plays such characters easily make up 30-50% of the network nodes which basically renders the whole network false. See example network for Amadis by Philippe Quinault attached.
amadins_quinault @lucagiovannini7 is going to check some of the plays relevant to his PhD research BUT we need a more systematic solution for the whole corpus. Especially since this looks like an automatable thing (split by ' et '). Of course, there are also many harder cases such as 'first african', 'second african', and 'both africans' making 3 nodes instead of 2, which also affects network metrics... But even resolving all "A et B" would be a giant leap for FreDraCor

lehkost commented 1 year ago

Well spotted, @DanilSko! We could probably apply these changes when converting our correction branch for upstream "Théâtre Classique" at https://github.com/dracor-org/theatre-classique to DraCor format – what do you think, @cmil?

In the cited play it looks like this:

<sp stage="decor/location" who="URGANDE ET ALQUIF">
  <speaker>URGANDE ET ALQUIF sous un riche pavillon.</speaker>
    <l id="1">…</l>
    [etc.]
</sp>

If we choose to automatically split the original speaker string in two at all occurrences of " ET ", we should make extra sure we're not overdoing it. There might be other reasons for " ET " in that string. If we have that under control, let's do it as proposed.

lucagiovannini7 commented 1 year ago

In the meanwhile, if I need to correct some character-related errors in the plays I'm using, should I just change the who-tags in the TEIs in the correction branch of the Theatre Classique repo (since the particDescs are apparently created later in the workflow)?

lucagiovannini7 commented 3 months ago

Update: through this notebook, I tried to compute the actual number of "glued-together" characters.

  1. Characters pairs linked with -et-: 101 (0.65% of items in FreDraCor). I checked all of them manually and they all need to be split ("#melpomene-et-euterpe" --> "#melpomene #euterpe")

  2. Characters pairs linked with an underscore: 124 (0.79% of items in FreDraCor). I then filtered out rare cases in which the underscore is used as a whitespace (e.g. "le_marquis"). All other items need to be split ("#claudine_blaise" --> "#claudine #blaise")

These errors seem easy to fix, but many other finer errors remain (e.g. "Les Furies", "Les trois Furies" in Quinault's Proserpine). For this task, one could maybe try computing Levenshtein distance within each cast list or something along this line. Anyway, it would be already a small progress to fix these pairs. I put all instances I found (with the dracor slug of the play they come from) in pairs.txt. @cmil do you think you can implement this?

cmil commented 22 hours ago

@DanilSko, @lucagiovannini7, @lehkost The above pull request should now fix most of these issues. It's a combination of programmatic changes and and corrections to the sources like this one: https://github.com/dracor-org/theatre-classique/commit/5e249f636621746c85677370ab7984b90adc2ae7.