dh-network / quadriga

DH network's repo for quadriga-related stuff
0 stars 1 forks source link

Minor didactic suggestion: Clarify the difference between different characters that represent the same letter #5

Open Philipp-Schneider opened 5 months ago

Philipp-Schneider commented 5 months ago

From a didactic perspective, I would suggest to shortly explain (or link to an explanation), how character sets work. More concretely, a lack of this knowledge might cause confusion in at least two places:

1) In FS_1_MVP_Data_Input_Homogenisation.ipynb in section 2.1, the OCR recognizes the long s ("ſ") as another character then "s". When creating the ground truth for the OCR, it could be made clear (maybe in section 2.1.1), that the decision between the two characters is an important modeling decesion that has consequences for all further processing steps as well as for the analysis.

2) Everytime a user enters a string to analyse the texts (e.g. in the word frequencies diagram in FS_1_MVP_Analysis_Prototype_101.ipynb), case sensitivity is important. It might not be clear to all users that "grippe" and "Grippe" are two different strings and therefore yield different results.

hsluytergaethje commented 1 week ago

Thank you. To 1): At the moment, the historic orthography of the long s is discussed briefly in the section concerning the rule-based post correction. To 2) We're still working on the text input to work in the jupyter book as well. But yes, we should add an info to case sensitivity to the analysis notebook as well as to the NLP notebook.