Closed miau1 closed 5 years ago
Could show the flags you used to initialize opus_reader
?
My your_script.py:
import opustools_pkg
opus_reader = opustools_pkg.OpusRead(["-d", "JW300", "-s", "en", "-t", "ta", "-wm" "tmx"])
opus_reader.printPairs()
My your_script.py:
import opustools_pkg opus_reader = opustools_pkg.OpusRead(["-d", "JW300", "-s", "en", "-t", "ta", "-wm" "tmx"]) opus_reader.printPairs()
You are missing a comma between "-wm"
and "tmx"
, but I don't think that's causing the error here. I suspect this is an issue with Windows encoding behavior. I'll try to setup a Windows environment and see if I can replicate the error. In the meanwhile, you could try to run your script in a unix-like environment, if you have access to one.
it was the comma indeed! thank you so much for having spot this!
Op vr 30 aug. 2019 om 15:04 schreef miau1 notifications@github.com:
My your_script.py:
import opustools_pkg opus_reader = opustools_pkg.OpusRead(["-d", "JW300", "-s", "en", "-t", "ta", "-wm" "tmx"]) opus_reader.printPairs()
You are missing a comma between "-wm" and "tmx", but I don't think that's causing the error here. I suspect this is an issue with Windows encoding behavior. I'll try to setup a Windows environment and see if I can replicate the error. In the meanwhile, you could try to run your script in a unix-like environment, if you have access to one.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Helsinki-NLP/OpusTools/issues/4?email_source=notifications&email_token=ACENS6WJCYFDED7BFAQ5HX3QHELGFA5CNFSM4ISLWNYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5RS7XY#issuecomment-526594015, or mute the thread https://github.com/notifications/unsubscribe-auth/ACENS6WXLOTOF26IZ4LVTTTQHELGFANCNFSM4ISLWNYA .
I think there is a problem anyhow, maybe only on WIN. I could not yet test on Linux.
This is my script:
import opustools_pkg opus_reader = opustools_pkg.OpusRead(["-d", "JW300", "-f", "-s", "en", "-t", "ta", "-wm", "tmx", "-w", "enta.tmx"]) opus_reader.printPairs()
and the Error I get is:
Traceback (most recent call last):
File "your_script.py", line 3, in
opus_reader.printPairs()
File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 350, in printPairs lastline = self.readAlignment(gzipAlign) File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 308, in readAlignment lastline = self.outputPair(self.par, line)[1] File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 251, in outputPair self.sendPairOutput(wpair) File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 210, in sendPairOutput self.resultfile.write(wpair[0]) File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1264.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 126-132: character maps to
and the TMX looks like this:
<?xml version="1.0" encoding="utf-8"?>
The Beauty of Bovine Design “ DAD , today our schoolteacher said that a cow has four stomachs , which it has developed by a process of evolution .
Op vr 30 aug. 2019 om 15:15 schreef Gert Van Assche gertva@gmail.com:
it was the comma indeed! thank you so much for having spot this!
Op vr 30 aug. 2019 om 15:04 schreef miau1 notifications@github.com:
My your_script.py:
import opustools_pkg opus_reader = opustools_pkg.OpusRead(["-d", "JW300", "-s", "en", "-t", "ta", "-wm" "tmx"]) opus_reader.printPairs()
You are missing a comma between "-wm" and "tmx", but I don't think that's causing the error here. I suspect this is an issue with Windows encoding behavior. I'll try to setup a Windows environment and see if I can replicate the error. In the meanwhile, you could try to run your script in a unix-like environment, if you have access to one.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Helsinki-NLP/OpusTools/issues/4?email_source=notifications&email_token=ACENS6WJCYFDED7BFAQ5HX3QHELGFA5CNFSM4ISLWNYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5RS7XY#issuecomment-526594015, or mute the thread https://github.com/notifications/unsubscribe-auth/ACENS6WXLOTOF26IZ4LVTTTQHELGFANCNFSM4ISLWNYA .
@gertva I haven't been able to set up a Windows environment, but I changed something in the program: I now specify the encoding of the result files, which might help with your issue. Upgrade opustools_pkg
to version 0.0.43
, run your script again and let me know if it works.
I'll probably get my hands on a Windows machine by tomorrow, so I'll be able to do proper debugging.
works like magic now! thank you so much.
gert
Op ma 2 sep. 2019 om 11:06 schreef miau1 notifications@github.com:
@gertva https://github.com/gertva I haven't been able to set up a Windows environment, but I changed something in the program: I now specify the encoding of the result files, which might help with your issue. Upgrade opustools_pkg to version 0.0.43, run your script again and let me know if it works.
I'll probably get my hands on a Windows machine by tomorrow, so I'll be able to do proper debugging.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Helsinki-NLP/OpusTools/issues/4?email_source=notifications&email_token=ACENS6XVZIN73IFQJMWWTA3QHTJQZA5CNFSM4ISLWNYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5VGVUA#issuecomment-527067856, or mute the thread https://github.com/notifications/unsubscribe-auth/ACENS6UUPD37GBKST473AF3QHTJQZANCNFSM4ISLWNYA .
Great, good to hear!
A question, if I want to filter on language, is this to correct construct?
opus_reader = opustools_pkg.OpusRead(["-d", "JW300", "-f", "-ln", "-s",
"en", "-t", "ta", "-wm", "tmx", "-w", "enta.tmx", "src_cld2", "en", "0.95", "trg_cld2", "ta", "0.95", "src_langid", "en", "0.95", "trg_langid", "ta", "0.95"])
My output file is always the same size, so I wonder if it is correct.
Thanks for your help.
Op ma 2 sep. 2019 om 12:53 schreef miau1 notifications@github.com:
Closed #4 https://github.com/Helsinki-NLP/OpusTools/issues/4.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Helsinki-NLP/OpusTools/issues/4?email_source=notifications&email_token=ACENS6V5ZGYQHKD2SKGWYVTQHTWC3A5CNFSM4ISLWNYKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOTMCB5NY#event-2600738487, or mute the thread https://github.com/notifications/unsubscribe-auth/ACENS6TMFWOYMH5UDPDMC7DQHTWC3ANCNFSM4ISLWNYA .
Almost, you have to include --
before the language id flags. You can also use split()
to make the argument list formation a little easier like this:
opus_reader = opustools_pkg.OpusRead("-d JW300 -f -ln -s en -t ta -wm tmx -w enta.tmx --src_cld2 en 0.95 --trg_cld2 ta 0.95 --src_langid en 0.95 --trg_langid ta 0.95".split())
But for this to work, you have to create the language ids to the zip files, as they don't yet include them by default. First you need to install pycld2
and langid
:
pip install pycld2
pip install langid
Then you can run a script like this to create the language ids:
from opustools_pkg.opus_langid import OpusLangid
OpusLangid("-f JW300_latest_xml_en.zip -v".split()).processFiles()
OpusLangid("-f JW300_latest_xml_ta.zip -v".split()).processFiles()
And then you can filter by language ids.
thanks for explaining so clearly.
Op ma 2 sep. 2019 om 14:53 schreef miau1 notifications@github.com:
Almost, you have to include -- before the language id flags. You can also use split() to make the argument list formation a little easier like this:
opus_reader = opustools_pkg.OpusRead("-d JW300 -f -ln -s en -t ta -wm tmx -w enta.tmx --src_cld2 en 0.95 --trg_cld2 ta 0.95 --src_langid en 0.95 --trg_langid ta 0.95".split())
But for this to work, you have to create the language ids to the zip files, as they don't yet include them by default. First you need to install pycld2 and langid:
pip install pycld2 pip install langid
Then you can run a script like this to create the language ids:
from opustools_pkg.opus_langid import OpusLangid
OpusLangid("-f JW300_latest_xml_en.zip -v".split()).processFiles() OpusLangid("-f JW300_latest_xml_en.zip -v".split()).processFiles()
And then you can filter by language ids.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Helsinki-NLP/OpusTools/issues/4?email_source=notifications&email_token=ACENS6QQYVSVDS6BOS7IECDQHUEF3A5CNFSM4ISLWNYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5VXYBA#issuecomment-527137796, or mute the thread https://github.com/notifications/unsubscribe-auth/ACENS6R7QDOGTWT6LDOH7BTQHUEF3ANCNFSM4ISLWNYA .
not sure if I can ask questions here, but I got stuck on this when I try to download a TMX from the OPUS JW300 set. It has nothing to do with the set, I think.
I have no idea how to fix this. Your help is highly appreciated.
Originally posted by @gertva in https://github.com/Helsinki-NLP/OpusTools/issues/3#issuecomment-526506417