Open Philipp-Schneider opened 5 months ago
Thank you. To 1): At the moment, the historic orthography of the long s is discussed briefly in the section concerning the rule-based post correction. To 2) We're still working on the text input to work in the jupyter book as well. But yes, we should add an info to case sensitivity to the analysis notebook as well as to the NLP notebook.
From a didactic perspective, I would suggest to shortly explain (or link to an explanation), how character sets work. More concretely, a lack of this knowledge might cause confusion in at least two places:
1) In
FS_1_MVP_Data_Input_Homogenisation.ipynb
in section 2.1, the OCR recognizes the long s ("ſ") as another character then "s". When creating the ground truth for the OCR, it could be made clear (maybe in section 2.1.1), that the decision between the two characters is an important modeling decesion that has consequences for all further processing steps as well as for the analysis.2) Everytime a user enters a string to analyse the texts (e.g. in the word frequencies diagram in
FS_1_MVP_Analysis_Prototype_101.ipynb
), case sensitivity is important. It might not be clear to all users that "grippe" and "Grippe" are two different strings and therefore yield different results.