MagedSaeed / farasapy

A Python implementation of Farasa toolkit
MIT License
112 stars 21 forks source link

Not compatible with Windows OS #3

Closed hefengxian closed 4 years ago

hefengxian commented 4 years ago

Not compatible with Windows OS

Thank you for your hard work on this lib.

Issues

But i find out on Windows platform there are several issues

So, i do some test about above issues

about shlex.split()

file issue_test.py under E:\workspace\python\farasapy\

from pathlib import Path
import shlex

cur_dir = Path(__file__).parent.absolute()
cmd = f'java -jar {cur_dir}/xxx.jar'
cmd_split = shlex.split(cmd)

print('cur_dir: ', cur_dir)
print('cmd: ', cmd)
print('cmd split parts: ', cmd_split)

run result

(venv) E:\workspace\python\farasapy>python .\issue_test.py
cur_dir:  E:\workspace\python\farasapy
cmd:  java -jar E:\workspace\python\farasapy/xxx.jar
cmd split parts:  ['java', '-jar', 'E:workspacepythonfarasapy/xxx.jar']

after called shlex.split() character \ was missing in result ['java', '-jar', 'E:workspacepythonfarasapy/xxx.jar']

about mode 「interactive」

when run interactive mode will show these error

  File "E:\workspace\python\machine_classifier\venv\lib\site-packages\farasa\__base.py", line 135, in _run_task_interactive
    output = self.__task_proc.stdout.readline().decode('utf8').strip()

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte

about mode 「standalone」

run tests.py under project farasapy, will give error like this

----------------------------------------
Farasa features, noninteractive mode.
----------------------------------------
perform system check...
check java version...
Your java version is 1.8 which is compatiple with Farasa
check toolkit binaries...
Dependencies seem to be satisfied..
task [SEGMENT] is initialized in STANDALONE mode...
error occured!
return code: 1
Traceback (most recent call last):
  File ".\tests.py", line 25, in <module>
    segmented = segmenter.segment(sample)
  File "E:\workspace\python\farasapy\farasa\segmenter.py", line 8, in segment
    return self._do_task(text=text)
  File "E:\workspace\python\farasapy\farasa\__base.py", line 160, in _do_task
    return self._do_task_standalone(strip_text)
  File "E:\workspace\python\farasapy\farasa\__base.py", line 153, in _do_task_standalone
    return self._run_task(btext=byted_strip_text)
  File "E:\workspace\python\farasapy\farasa\__base.py", line 128, in _run_task
    raise Exception('Internal Error occured!')
Exception: Internal Error occured!

no useful information, so i change the code about function _run_task()

def _run_task(self, btext):
    assert btext is not None
    with tempfile.NamedTemporaryFile(dir=f'{self.__base_dir}/tmp',) as itmp,\
                tempfile.NamedTemporaryFile(dir=f'{self.__base_dir}/tmp',) as otmp:
        itmp.write(btext)
        itmp.flush() # https://stackoverflow.com/questions/46004774/python-namedtemporaryfile-appears-empty-even-after-data-is-written
        proc = subprocess.run(self.__APIs[self.task]+['-i',itmp.name,'-o',otmp.name],\
                                capture_output=True)
        print(proc.stdout, proc.stderr)
        if proc.returncode == 0:
            return otmp.read().decode('utf8').strip()
        else:
            print("error occured!",otmp.read().decode('utf8').strip())
            print("return code:",proc.returncode)
            raise Exception('Internal Error occured!')

add line print(proc.stdout, proc.stderr), then i get errors from Java

b'' b'Initializing the system ....\rSystem ready!               \r\nException in thread "main" java.io.FileNotFoundException: E:\\workspace\\python\\farasapy\\farasa\\tmp\\tmpltwyiodu (\xe5\x8f\xa6\xe4\xb8\x80\xe4\xb8\xaa\xe7\xa8\x8b\xe5\xba\x8f\xe6\xad\xa3\xe5\x9c\xa8\xe4\xbd\xbf\xe7\x94\xa8\xe6\xad\xa4\xe6\x96\x87\xe4\xbb\xb6\xef\xbc\x8c\xe8\xbf\x9b\xe7\xa8\x8b\xe6\x97\xa0\xe6\xb3\x95\xe8\xae\xbf\xe9\x97\xae\xe3\x80\x82)\r\n\tat java.io.FileInputStream.open0(Native Method)\r\n\tat java.io.FileInputStream.open(FileInputStream.java:195)\r\n\tat java.io.FileInputStream.<init>(FileInputStream.java:138)\r\n\tat com.qcri.farasa.segmenter.TestCase.openFileForReading(TestCase.java:633)\r\n\tat com.qcri.farasa.segmenter.TestCase.processFile(TestCase.java:128)\r\n\tat com.qcri.farasa.segmenter.TestCase.main(TestCase.java:105)\r\n'

it's cannot find tempfile, so i check Python docs https://docs.python.org/3/library/tempfile.html, on Windows there is problem with not have param delete=False in tempfile.NamedTemporaryFile()

Solution

i will submit a PR

about shlex.split() & mode 「interactive」

  1. change shlex.split() to list()
  2. set java encoding with option -Dfile.encoding=UTF-8
    task = None
    __base_dir = Path(__file__).parent.absolute()
    __bin_dir = Path(f'{__base_dir}/farasa_bin')
    __bin_lib_dir = Path(f'{__bin_dir}/lib')

    # shlex not compatible with Windows replace it with list()
    # set java encoding with option `-Dfile.encoding=UTF-8`
    __BASE_CMD = ['java', '-Dfile.encoding=UTF-8', '-jar']
    __APIs = {
        'segment': __BASE_CMD + [str(__bin_lib_dir / 'FarasaSegmenterJar.jar')],
        'stem': __BASE_CMD + [str(__bin_lib_dir / 'FarasaSegmenterJar.jar'), '-l', 'true'],
        'NER': __BASE_CMD + [str(__bin_dir / 'FarasaNERJar.jar')],
        'POS': __BASE_CMD + [str(__bin_dir / 'FarasaPOSJar.jar')],
        'diacritize': __BASE_CMD + [str(__bin_dir / 'FarasaDiacritizeJar.jar')]
    }

about mode 「standalone」

we set tempfile.NamedTemporaryFile(dir='...', delete=False)

    def _run_task(self, btext):
        assert btext is not None
        with tempfile.NamedTemporaryFile(dir=f'{self.__base_dir}/tmp', delete=False) as itmp, \
                tempfile.NamedTemporaryFile(dir=f'{self.__base_dir}/tmp', delete=False) as otmp:
            itmp.write(btext)
            itmp.flush()  # https://stackoverflow.com/questions/46004774/python-namedtemporaryfile-appears-empty-even-after-data-is-written
            proc = subprocess.run(self.__APIs[self.task] + ['-i', itmp.name, '-o', otmp.name], \
                                  capture_output=True)
            if proc.returncode == 0:
                return otmp.read().decode('utf8').strip()
            else:
                print("error occured!", proc.stderr)
                print("return code:", proc.returncode)
                raise Exception('Internal Error occured!')

other

code style not good, reformat core code http://google.github.io/styleguide/pyguide.html

test

i test the code after change on Windows 10 & macOS Catalina, it's passed

hefengxian commented 4 years ago

Regarding formatting, thanks for pointing to such an issue. I thought that my editor will do that by default. However, I realized that I need to configure it, first, to do so. I did a little research about this, I found black, yapf and pep8 to be the most popular formatters. What your preferred one? For me, I think I will adopt yapf. It is driving more attention recently. By the way, I should also add a badge for the coding format style I will follow to README just for the sake of elaboration.

I'm using PyCharm IDE to format code, this is very good tool to develop Python applications, they are also free version PyCharm Community.

I think Google code style almost same with PEP8, and i prefer PEP8

MagedSaeed commented 4 years ago

Thanks for your work @hefengxian, I will try to merge and update tonight.

MagedSaeed commented 4 years ago

hey @hefengxian, I am really sorry. I run into some issues these two days. I was unable to find a suitable time to search and think about the issue of tempfile. I hope I will be able to do so tomorrow and close this pull request.

MagedSaeed commented 4 years ago

@hefengxian I tested the changes on my machine and merged the pull request. Thanks for your effort, really appreciate it. Looking forward to your collaborations and helpful comments.