Closed dwhieb closed 3 years ago
Given my previous experience with deep learning, I really fear that the model is going to overfit on lines starting with "Â"/"A". Maybe the tesseract folks have already accounted for this, but unshuffled data is a training nightmare on any kind of supervised learning system using a stochastic, iterative weight optimization (i.e., stochastic gradient descent and its variants).
BASICALLY: could you grab pages with more diverse lines? If a majority of lines start with "ÂKW", then most naïve machine learning algorithms will assume most likes start with "ÂKW", which would... not be great when you start getting into the "K" section!
@eddieantonio Oh man, great insight - thanks! We're going to try using the regular French OCR on the Cree words first and see what kind of results that gets us (since all of the Cree letters are contained within the French alphabet). If that turns out to be really accurate, great - we'll just go with that. If not, I'll work with Daniel to create a more diverse set of training data.
This PR adds manual transcriptions and notes on the Lacombe dictionary from Daniel Dacanay.