TurkuNLP / Turku-neural-parser-pipeline

A neural parsing pipeline for segmentation, morphological tagging, dependency parsing and lemmatization with pre-trained models for more than 50 languages. Top ranker in the CoNLL-18 Shared Task.
https://turkunlp.github.io/Turku-neural-parser-pipeline/
Apache License 2.0
111 stars 31 forks source link

Text encoding format for Finnish (scandic characters) #21

Closed kauttoj closed 5 years ago

kauttoj commented 5 years ago

Firstly, thanks for the new pipeline. Great work. I'm running the pipeline in Windows using docker image (latest version at 6.6.2019) using the command: cat input_text.txt | docker run -i turkunlp/turku-neural-parser:latest-fi-en-sv-cpu stream fi_tdt parse_plaintext > output_text.txt It runs fine (no errors), but does not understand scandic characters (ä and ö), which are replaced with question mark, e.g., Pit??k? (was Pitääkö) and hy?ty? (was hyötyä). Changing the input text format does not seem to make a difference (e.g., UTF-8 or USC-2). How can I process text with scandic characters?

kauttoj commented 5 years ago

I answer to myself. The problem is with Windows (as usual), not docker or turkuNLP image. To properly process scandic characters in Windows 10 PowerShell, the following seems to work.

First in PowerShell change the default encoding into UTF8 with command: $OutputEncoding = [Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8

Then do not use "cat" to input text, but instead: Get-Content -Encoding utf8 input_text.txt | docker run -i turkunlp/turku-neural-parser:latest-fi-en-sv-cpu stream fi_tdt parse_plaintext > output_text.txt

jmnybl commented 5 years ago

Thanks for reporting the issue and a solution for it! :)

fginter commented 5 years ago

Good job @kauttoj - sorry for no reply - superbusy times on our side. We could try to modify the docker images to hard-code utf-8 on input or something. We don't actually have a Windows machine to test this on, though. :rofl: I will reference this issue in the docs. Thank you!