emanjavacas / pie

A fully-fledge PyTorch package for Morphological Analysis, tailored to morphologically rich and historical languages.
MIT License
22 stars 10 forks source link

Propose a continuous sentence tokenizer #21

Closed PonteIneptique closed 5 years ago

PonteIneptique commented 5 years ago

The general idea is the following : some historical languages lacks punctuation (Old French, Classical Latin). Punctuation are mostly editors decisions. It could be, like for the LASLA dataset, that you find yourself without punctuation in training data. In this case, it would be cool to be able to provide a sentence size, an have repeated edge.

Ie :

sentence = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

continuous_tokenizer(sentence, words=5, repeating_edge=1)

would result in

[
  ('Lorem', 'ipsum', 'dolor', 'sit', 'amet,'),
  ('amet', 'consectetur', 'adipiscing', 'elit,', 'sed'),
  ('sed', 'do', 'eiusmod', 'tempor', 'incididunt'),
  # etc.
]

And the tagger could take the tokenizer as parameter to avoid repeating the lemmatized token, such as https://github.com/chartes/deucalion-model-lasla/blob/master/flaskapp.py#L61-L85

emanjavacas commented 5 years ago

Hi, sorry I just realized about this now!

My thoughts on the tokenization are these. Ideally tokenization is already done. I'd assume people would know what tokenization the training data was in, and they have more test data with the same kind of tokenization. The reason why I had a tokenizer in the tagger was just for a little project I had where I needed to apply some quick tokenization. I think if we start adding different types of tokenization to tagger, that'd confuse possible future users. That said, it's probably not to bad to let Tagger take a custom tokenizer if needed. But any other preprocessing, I'd just keep it out of pie.

I am saying this without properly understanding what you propose. So maybe I am wrong about your suggestion. Assuming there was no punctuation in the training data, why would repeating the last word at the beginining of the next sentence help? Couldn't you just strip all punctuation from the test data?

PonteIneptique commented 5 years ago

I definitely see why it should not be here :) I am gonna have a look at it and if my feeling is right, I'll tell you.