Open kmelve opened 9 years ago
Hi @kmelve,
Sorry for the slow response, I was traveling and away from email. You're hitting this problem because I was using a really simple, sledgehammer approach to tokenization - just matching [a-z]+
patterns in the source text, which works well enough for English, but not for non-ascii characters.
I just pushed a fix for this on the feature/unicode
branch, which now will consider any series of non-digit, non-punctuation characters to be a word - this should work with the Norwegian characters.
To try it, just install the project from source:
git clone https://github.com/davidmcclure/textplot.git
cd textplot
pyvenv env
. env/bin/activate
And then check out the branch:
git checkout -b feature/unicode origin/feature/unicode
pip install -r requirements.txt
pip setup.py develop
And give it a spin. If it works, I'll merge this into master and cut off a new release. Thanks for bringing this to my attention!
Love this app. That being said. I have some challenges with CLI-tool. Norwegian characteres like æ ø å seems to disappear in the .gml file. Python3 should support UTF-8 out of the box, and I'm not sure how to go about troubleshooting this. Any ideas?