davidmcclure / textplot

(Mental) maps of texts with kernel density estimation and force-directed networks.
MIT License
106 stars 35 forks source link

Issue with Scandinavian characters #2

Open kmelve opened 9 years ago

kmelve commented 9 years ago

Love this app. That being said. I have some challenges with CLI-tool. Norwegian characteres like æ ø å seems to disappear in the .gml file. Python3 should support UTF-8 out of the box, and I'm not sure how to go about troubleshooting this. Any ideas?

davidmcclure commented 9 years ago

Hi @kmelve,

Sorry for the slow response, I was traveling and away from email. You're hitting this problem because I was using a really simple, sledgehammer approach to tokenization - just matching [a-z]+ patterns in the source text, which works well enough for English, but not for non-ascii characters.

I just pushed a fix for this on the feature/unicode branch, which now will consider any series of non-digit, non-punctuation characters to be a word - this should work with the Norwegian characters.

To try it, just install the project from source:

git clone https://github.com/davidmcclure/textplot.git
cd textplot
pyvenv env
. env/bin/activate

And then check out the branch:

git checkout -b feature/unicode origin/feature/unicode
pip install -r requirements.txt
pip setup.py develop

And give it a spin. If it works, I'll merge this into master and cut off a new release. Thanks for bringing this to my attention!