ColinFay / proustr

Tools for Natural Language Processing in French and texts from Marcel Proust's collection "A La Recherche Du Temps Perdu"
http://proustr.colinfay.me/
Other
24 stars 2 forks source link

tokenization in French #11

Open lvaudor opened 2 years ago

lvaudor commented 2 years ago

Hi Colin,

I'm using tidytext for tokenization, but have some problems with texts in French. For instance "L'achat" or "j'ai" are not separated as they should be. In an issue regarding tidytext you mentioned that you were working on a tokenizer that would work well for French and I got the impression that it was intended for the proustr package. Can you tell me more about it?

Cheers, Lise