Closed kauttoj closed 5 years ago
I answer to myself. The problem is with Windows (as usual), not docker or turkuNLP image. To properly process scandic characters in Windows 10 PowerShell, the following seems to work.
First in PowerShell change the default encoding into UTF8 with command:
$OutputEncoding = [Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8
Then do not use "cat" to input text, but instead:
Get-Content -Encoding utf8 input_text.txt | docker run -i turkunlp/turku-neural-parser:latest-fi-en-sv-cpu stream fi_tdt parse_plaintext > output_text.txt
Thanks for reporting the issue and a solution for it! :)
Good job @kauttoj - sorry for no reply - superbusy times on our side. We could try to modify the docker images to hard-code utf-8 on input or something. We don't actually have a Windows machine to test this on, though. :rofl: I will reference this issue in the docs. Thank you!
Firstly, thanks for the new pipeline. Great work. I'm running the pipeline in Windows using docker image (latest version at 6.6.2019) using the command:
cat input_text.txt | docker run -i turkunlp/turku-neural-parser:latest-fi-en-sv-cpu stream fi_tdt parse_plaintext > output_text.txt
It runs fine (no errors), but does not understand scandic characters (ä and ö), which are replaced with question mark, e.g., Pit??k? (was Pitääkö) and hy?ty? (was hyötyä). Changing the input text format does not seem to make a difference (e.g., UTF-8 or USC-2). How can I process text with scandic characters?