Closed PonteIneptique closed 5 years ago
Hi, sorry I just realized about this now!
My thoughts on the tokenization are these. Ideally tokenization is already done. I'd assume people would know what tokenization the training data was in, and they have more test data with the same kind of tokenization. The reason why I had a tokenizer in the tagger was just for a little project I had where I needed to apply some quick tokenization. I think if we start adding different types of tokenization to tagger, that'd confuse possible future users. That said, it's probably not to bad to let Tagger take a custom tokenizer if needed. But any other preprocessing, I'd just keep it out of pie.
I am saying this without properly understanding what you propose. So maybe I am wrong about your suggestion. Assuming there was no punctuation in the training data, why would repeating the last word at the beginining of the next sentence help? Couldn't you just strip all punctuation from the test data?
I definitely see why it should not be here :) I am gonna have a look at it and if my feeling is right, I'll tell you.
The general idea is the following : some historical languages lacks punctuation (Old French, Classical Latin). Punctuation are mostly editors decisions. It could be, like for the LASLA dataset, that you find yourself without punctuation in training data. In this case, it would be cool to be able to provide a sentence size, an have repeated edge.
Ie :
would result in
And the tagger could take the tokenizer as parameter to avoid repeating the lemmatized token, such as https://github.com/chartes/deucalion-model-lasla/blob/master/flaskapp.py#L61-L85