Helsinki-NLP / OpusTools

67 stars 17 forks source link

not sure if I can ask questions here, but I got stuck on this when I try to download a TMX from the OPUS JW300 set. It has nothing to do with the set, I think. #4

Closed miau1 closed 5 years ago

miau1 commented 5 years ago

not sure if I can ask questions here, but I got stuck on this when I try to download a TMX from the OPUS JW300 set. It has nothing to do with the set, I think.

Alignment file /proj/nlpl/data/OPUS/JW300/latest/xml/en-ta.xml.gz not found. The following files are available for downloading:

   8 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/en-ta.xml.gz
 263 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/en.zip
  94 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/ta.zip

 365 MB Total size
Downloading 3 file(s) with the total size of 365 MB. Continue? (y/n) y
JW300_latest_xml_en-ta.xml.gz ... 100% of 8 MB
JW300_latest_xml_en.zip ... 100% of 263 MB
JW300_latest_xml_ta.zip ... 100% of 94 MB
Traceback (most recent call last):
  File "your_script.py", line 3, in <module>
    opus_reader.printPairs()
  File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 350, in printPairs
    lastline = self.readAlignment(gzipAlign)
  File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 308, in readAlignment
    lastline = self.outputPair(self.par, line)[1]
  File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 251, in outputPair
    self.sendPairOutput(wpair)
  File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 210, in sendPairOutput
    self.resultfile.write(wpair[0])
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1264.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 83-89: character maps to <undefined>

I have no idea how to fix this. Your help is highly appreciated.

Originally posted by @gertva in https://github.com/Helsinki-NLP/OpusTools/issues/3#issuecomment-526506417

miau1 commented 5 years ago

Could show the flags you used to initialize opus_reader?

gertva commented 5 years ago

My your_script.py:

import opustools_pkg
opus_reader = opustools_pkg.OpusRead(["-d", "JW300", "-s", "en", "-t", "ta", "-wm" "tmx"])
opus_reader.printPairs()
miau1 commented 5 years ago

My your_script.py:

import opustools_pkg
opus_reader = opustools_pkg.OpusRead(["-d", "JW300", "-s", "en", "-t", "ta", "-wm" "tmx"])
opus_reader.printPairs()

You are missing a comma between "-wm" and "tmx", but I don't think that's causing the error here. I suspect this is an issue with Windows encoding behavior. I'll try to setup a Windows environment and see if I can replicate the error. In the meanwhile, you could try to run your script in a unix-like environment, if you have access to one.

gertva commented 5 years ago

it was the comma indeed! thank you so much for having spot this!

Op vr 30 aug. 2019 om 15:04 schreef miau1 notifications@github.com:

My your_script.py:

import opustools_pkg opus_reader = opustools_pkg.OpusRead(["-d", "JW300", "-s", "en", "-t", "ta", "-wm" "tmx"]) opus_reader.printPairs()

You are missing a comma between "-wm" and "tmx", but I don't think that's causing the error here. I suspect this is an issue with Windows encoding behavior. I'll try to setup a Windows environment and see if I can replicate the error. In the meanwhile, you could try to run your script in a unix-like environment, if you have access to one.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Helsinki-NLP/OpusTools/issues/4?email_source=notifications&email_token=ACENS6WJCYFDED7BFAQ5HX3QHELGFA5CNFSM4ISLWNYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5RS7XY#issuecomment-526594015, or mute the thread https://github.com/notifications/unsubscribe-auth/ACENS6WXLOTOF26IZ4LVTTTQHELGFANCNFSM4ISLWNYA .

gertva commented 5 years ago

I think there is a problem anyhow, maybe only on WIN. I could not yet test on Linux.

This is my script:

import opustools_pkg opus_reader = opustools_pkg.OpusRead(["-d", "JW300", "-f", "-s", "en", "-t", "ta", "-wm", "tmx", "-w", "enta.tmx"]) opus_reader.printPairs()

and the Error I get is:

Traceback (most recent call last):

File "your_script.py", line 3, in

opus_reader.printPairs()

File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 350, in printPairs lastline = self.readAlignment(gzipAlign) File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 308, in readAlignment lastline = self.outputPair(self.par, line)[1] File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 251, in outputPair self.sendPairOutput(wpair) File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 210, in sendPairOutput self.resultfile.write(wpair[0]) File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1264.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 126-132: character maps to

and the TMX looks like this:

<?xml version="1.0" encoding="utf-8"?>

The Beauty of Bovine Design “ DAD , today our schoolteacher said that a cow has four stomachs , which it has developed by a process of evolution .

Op vr 30 aug. 2019 om 15:15 schreef Gert Van Assche gertva@gmail.com:

it was the comma indeed! thank you so much for having spot this!

Op vr 30 aug. 2019 om 15:04 schreef miau1 notifications@github.com:

My your_script.py:

import opustools_pkg opus_reader = opustools_pkg.OpusRead(["-d", "JW300", "-s", "en", "-t", "ta", "-wm" "tmx"]) opus_reader.printPairs()

You are missing a comma between "-wm" and "tmx", but I don't think that's causing the error here. I suspect this is an issue with Windows encoding behavior. I'll try to setup a Windows environment and see if I can replicate the error. In the meanwhile, you could try to run your script in a unix-like environment, if you have access to one.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Helsinki-NLP/OpusTools/issues/4?email_source=notifications&email_token=ACENS6WJCYFDED7BFAQ5HX3QHELGFA5CNFSM4ISLWNYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5RS7XY#issuecomment-526594015, or mute the thread https://github.com/notifications/unsubscribe-auth/ACENS6WXLOTOF26IZ4LVTTTQHELGFANCNFSM4ISLWNYA .

miau1 commented 5 years ago

@gertva I haven't been able to set up a Windows environment, but I changed something in the program: I now specify the encoding of the result files, which might help with your issue. Upgrade opustools_pkg to version 0.0.43, run your script again and let me know if it works.

I'll probably get my hands on a Windows machine by tomorrow, so I'll be able to do proper debugging.

gertva commented 5 years ago

works like magic now! thank you so much.

gert

Op ma 2 sep. 2019 om 11:06 schreef miau1 notifications@github.com:

@gertva https://github.com/gertva I haven't been able to set up a Windows environment, but I changed something in the program: I now specify the encoding of the result files, which might help with your issue. Upgrade opustools_pkg to version 0.0.43, run your script again and let me know if it works.

I'll probably get my hands on a Windows machine by tomorrow, so I'll be able to do proper debugging.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Helsinki-NLP/OpusTools/issues/4?email_source=notifications&email_token=ACENS6XVZIN73IFQJMWWTA3QHTJQZA5CNFSM4ISLWNYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5VGVUA#issuecomment-527067856, or mute the thread https://github.com/notifications/unsubscribe-auth/ACENS6UUPD37GBKST473AF3QHTJQZANCNFSM4ISLWNYA .

miau1 commented 5 years ago

Great, good to hear!

gertva commented 5 years ago

A question, if I want to filter on language, is this to correct construct?

opus_reader = opustools_pkg.OpusRead(["-d", "JW300", "-f", "-ln", "-s",

"en", "-t", "ta", "-wm", "tmx", "-w", "enta.tmx", "src_cld2", "en", "0.95", "trg_cld2", "ta", "0.95", "src_langid", "en", "0.95", "trg_langid", "ta", "0.95"])

My output file is always the same size, so I wonder if it is correct.

Thanks for your help.

Op ma 2 sep. 2019 om 12:53 schreef miau1 notifications@github.com:

Closed #4 https://github.com/Helsinki-NLP/OpusTools/issues/4.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Helsinki-NLP/OpusTools/issues/4?email_source=notifications&email_token=ACENS6V5ZGYQHKD2SKGWYVTQHTWC3A5CNFSM4ISLWNYKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOTMCB5NY#event-2600738487, or mute the thread https://github.com/notifications/unsubscribe-auth/ACENS6TMFWOYMH5UDPDMC7DQHTWC3ANCNFSM4ISLWNYA .

miau1 commented 5 years ago

Almost, you have to include -- before the language id flags. You can also use split() to make the argument list formation a little easier like this:

opus_reader = opustools_pkg.OpusRead("-d JW300 -f -ln -s en -t ta -wm tmx -w enta.tmx --src_cld2 en 0.95 --trg_cld2 ta 0.95 --src_langid en 0.95 --trg_langid ta 0.95".split())

But for this to work, you have to create the language ids to the zip files, as they don't yet include them by default. First you need to install pycld2 and langid:

pip install pycld2
pip install langid

Then you can run a script like this to create the language ids:

from opustools_pkg.opus_langid import OpusLangid

OpusLangid("-f JW300_latest_xml_en.zip -v".split()).processFiles()
OpusLangid("-f JW300_latest_xml_ta.zip -v".split()).processFiles()

And then you can filter by language ids.

gertva commented 5 years ago

thanks for explaining so clearly.

Op ma 2 sep. 2019 om 14:53 schreef miau1 notifications@github.com:

Almost, you have to include -- before the language id flags. You can also use split() to make the argument list formation a little easier like this:

opus_reader = opustools_pkg.OpusRead("-d JW300 -f -ln -s en -t ta -wm tmx -w enta.tmx --src_cld2 en 0.95 --trg_cld2 ta 0.95 --src_langid en 0.95 --trg_langid ta 0.95".split())

But for this to work, you have to create the language ids to the zip files, as they don't yet include them by default. First you need to install pycld2 and langid:

pip install pycld2 pip install langid

Then you can run a script like this to create the language ids:

from opustools_pkg.opus_langid import OpusLangid

OpusLangid("-f JW300_latest_xml_en.zip -v".split()).processFiles() OpusLangid("-f JW300_latest_xml_en.zip -v".split()).processFiles()

And then you can filter by language ids.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Helsinki-NLP/OpusTools/issues/4?email_source=notifications&email_token=ACENS6QQYVSVDS6BOS7IECDQHUEF3A5CNFSM4ISLWNYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5VXYBA#issuecomment-527137796, or mute the thread https://github.com/notifications/unsubscribe-auth/ACENS6R7QDOGTWT6LDOH7BTQHUEF3ANCNFSM4ISLWNYA .