UAlbertaALTLab / crk-db

Managing the Plains Cree dictionary database
https://itwewina.altlab.app/
GNU General Public License v3.0
0 stars 3 forks source link

add information on Lacombe dictionary #39

Closed dwhieb closed 3 years ago

dwhieb commented 3 years ago

This PR adds manual transcriptions and notes on the Lacombe dictionary from Daniel Dacanay.

eddieantonio commented 3 years ago

Given my previous experience with deep learning, I really fear that the model is going to overfit on lines starting with "Â"/"A". Maybe the tesseract folks have already accounted for this, but unshuffled data is a training nightmare on any kind of supervised learning system using a stochastic, iterative weight optimization (i.e., stochastic gradient descent and its variants).

BASICALLY: could you grab pages with more diverse lines? If a majority of lines start with "ÂKW", then most naïve machine learning algorithms will assume most likes start with "ÂKW", which would... not be great when you start getting into the "K" section!

dwhieb commented 3 years ago

@eddieantonio Oh man, great insight - thanks! We're going to try using the regular French OCR on the Cree words first and see what kind of results that gets us (since all of the Cree letters are contained within the French alphabet). If that turns out to be really accurate, great - we'll just go with that. If not, I'll work with Daniel to create a more diverse set of training data.