apertium / apertium-python

now you can even use apertium from python
GNU General Public License v3.0
30 stars 27 forks source link

GSoC 2019 Progress Updates #40

Open sushain97 opened 5 years ago

sushain97 commented 5 years ago

This is an issue so that notifications come through nicely.

singh-lokendra commented 5 years ago

Black: The Uncompromising Code Formatter

  1. Should I go ahead and integrate Black with apertium-python? A badge is also available for ReadMe file
  2. Currently there are a lot of branches in the project. Should I commit changes to master branch each time, or use some other branch for commiting the code?
sushain97 commented 5 years ago

Personally, I think 88 columns is too short. However, I really don't care that much. Also don't care about single/double quotes (despite having my own preference for the former). Don't worry about setting up commit hooks for now.

Should I go ahead and integrate Black with apertium-python? A badge is also available for ReadMe file

As far as checking goes, you can send a PR to switch the flake8 checks to black. This also requires changing the requirements-dev.txt (we should really just switch to pipenv).

Currently there are a lot of branches in the project. Should I commit changes to master branch each time, or use some other branch for commiting the code?

Use your own branches and send PRs to master that I'll review.

We really shouldn't spend any more time on this than the one day that we've already spent. The existing setup is perfectly fine so please don't spend more than an hour or two on changing style checking stuff tomorrow. I'd much rather you get started on the real work and we worry about this later.

singh-lokendra commented 5 years ago

Few Queries:

sushain97 commented 5 years ago
  • I've put the black code formatter on hold for now. I'll use it locally and ensure that linter checks are also passed

SGTM.

  • I have started working on apertium.analyzers Its calling eng-morph.mode which in turn calls lt-proc.exe So I am looking at its source . I am assuming this to be a good starting point

Exactly where I would start.

  • The lt-proc is dependent on a lot files. Just the source of fst_processor gave me goosebumps :D

:D I think to start we'll only need to make the edges of the interface usable in Python? All of the FST processor itself doesn't need to be visible from Python.

I am confused regarding the approach to be taken for making the swig wrapper. Should I start with lt-proc and make wrappers for all the files included or sequentially write wrappers for every file in lttoolbox

My perspective is that we should start with wrapping as little as possible in order to get e.g. eng-morph.mode working without subprocesses. That definitely shouldn't require wrapping every single file in lttoolbox. I'm not particularly familiar with SWIG but I think this should be possible? @unhammer / @TinoDidriksen could confirm?

  • Using disutils Imo this should be used as it can handle various cases
  • Dynamic linking Need to compile each file seperately. Would require some sort of makefile. And specified naming convention needs to be followed, else import error while importing the python library.

Since distutils is part of the standard library, I'm perfectly fine with this approach. However, given that there's already a build system in place for lttoolbox (probably nothing too unconventional -- https://github.com/apertium/lttoolbox/blob/master/Makefile.am), you might have to use the dynamic linking approach.

  • Searching the logs directed me to swig interface file for hfst. This might be helpful.

This should serve as a great example. You might even want to copy its approach as much as possible. Take a look at the README: https://github.com/hfst/hfst/tree/master/python.

sushain97 commented 5 years ago

Note that if we did end up wrapping a large portion of lttoolbox based on some desire for such functionality, we should release that separately on PyPi and then depend on it here in the apertium module.

TinoDidriksen commented 5 years ago

Regarding SWIG: The needed functions are the ones that https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc calls - grep for fstp. and LtLocale::. You shouldn't need to wrap more than that for now.

sushain97 commented 5 years ago

Makes sense. The MVP is wrapping every function that lt_proc.cc calls when invoked with -n. A quick look makes that seem like:

https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L111 https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L230 https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L245 https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L270 https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L398-L400

Note that all the file operations can just be done in Python since that shims out to 'native' code anyway.

singh-lokendra commented 5 years ago
sushain97 commented 5 years ago

Working example: foo() is defined in foo.cpp, which calls bar(), defined in bar.cpp Successfully implemented a wrapper for foo.cpp.

Great!

Required for wrapping endProgram() and checkValidity()

There's no real need to wrap endProgram() or checkValidity(). You can just replicate what they do in Python. I suspect that if we wanted to wrap them, we we need to add a function declaration to them in the header file.

While working on an example for wrapping main(), ran into unexpected output. If the function name is modified to main2(), desired output is available. Will look into this and report back.

Why are we wrapping main? It looks to me like the SWIG docs recommend against it and I don't think it's needed either. The lines I noted above are really all that matters. We would need to replicate calls to them inside the Python but that's fine (especially for an MVP).

All header files require an interface file. SWIG provides header files for some standard libraries.

We don't need to actually call any code that hits getopt or libgen?

TinoDidriksen commented 5 years ago

The SWIG .i file only needs to list the functions we want exported. So just don't list things from cstdlib or getopt or etc. Basically grep out the FST calls from lt_proc.cc and stuff those in a .i which includes the library headers in %{ %}.

The idea was never to wrap lt_proc.cc - it was to look in lt_proc.cc for what needs to be wrapped.

sushain97 commented 5 years ago

@Vaydheesh any luck? Tino's advice should be helpful.

singh-lokendra commented 5 years ago

I had exam on 14 and my next exam is tomorrow. So couldn't look into it. I'll try this after the exam

sushain97 commented 5 years ago

Sounds good. Best of luck on your exams!

singh-lokendra commented 5 years ago
sushain97 commented 5 years ago

Hope your exams went well. That sounds like a good start!

I'd like to see a PR soon (maybe tomorrow) even if it's a WIP. Tino and/or I can take a quick look to double check that you're not going down the wrong path.

edit: note that everything should probably go in a Makefile, a la https://github.com/hfst/hfst/blob/master/python/Makefile.am

sushain97 commented 5 years ago

@Vaydheesh how's it going?

singh-lokendra commented 5 years ago

Also,

sushain97 commented 5 years ago

I'm a bit confused. You should be using tempfile to create new temporary output and input files for each call. They should never be reused (at least for now). If https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryFile isn't working well for you, try https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile.

singh-lokendra commented 5 years ago

create new temporary output and input files for each call.

I am using separate files for input and output

  • If I create new file, write the text to be analysed in input_file, call init_analysis(), the output is not flushed to output file.
  • When I call init_analysis() for existing files, with text to be analysed present in the file, I am getting desired output
sushain97 commented 5 years ago

Hmm, okay. Let me know if I can help (what's going wrong?). Something straightforward like

with tempfile.NamedTemporaryFile('w') as input_file, tempfile.NamedTemporaryFile('r') as output_file:
    input_file.write(input_text)
    input_file.flush() # might not be needed
    init_analysis(automorf_path, input_file.name, output_file.name)
    output = output_file.read()

If you can get the above working, don't worry about file descriptors for now. I'd rather get something to work and then improve it than immediately trying to jump to the optimal approach.

singh-lokendra commented 5 years ago

This is what I've run on interpreter

import tempfile
import analysis
with tempfile.NamedTemporaryFile('w') as input_file, tempfile.NamedTemporaryFile('r') as output_file:
    input_text = 'cats'
    automorf_path = "/usr/share/apertium/apertium-eng/eng.automorf.bin"
    input_file.write(input_text)
    input_file.flush() # might not be needed
    x = analysis.FST()
    x.init_analysis(automorf_path, input_file.name, output_file.name)
    output = output_file.read()
    print(output)
    input() # pause 

Output: 4 for writing 'cats'

When I checked the temporary file created in /tmp using cat $filename

sushain97 commented 5 years ago

print(output) shouldn't print 4. It should print whatever is in the file... Very odd.

I'm not sure if this will help at all but try using tempfile.NamedTemporaryFile('r', delete=False).

sushain97 commented 5 years ago

Also, instead of using input() to pause, do something like breakpoint(), then you can also inspect things.

singh-lokendra commented 5 years ago

print(output) shouldn't print 4. It should print whatever is in the file... Very odd

4 is output of input_file.write(input_text) print(output) doesn't print anything

singh-lokendra commented 5 years ago

print(output) shouldn't print 4. It should print whatever is in the file... Very odd.

I'm not sure if this will help at all but try using tempfile.NamedTemporaryFile('r', delete=False).

The file persists, but it is empty

sushain97 commented 5 years ago

Try adding some debugging to the C++ (or using gdb I suppose) to figure out whether the issue is that there's no input being read or the output isn't being written? You can also try using https://en.cppreference.com/w/cpp/io/c/tmpfile and passing in/out strings. I don't like that approach since it requires more code in the C++ wrapper but lets see if we can get something to work first.

singh-lokendra commented 5 years ago
void init_analysis(char *automorf_path, char *input_text, char output_text[50])
{
  setDictionaryCaseMode(true);
  LtLocale::tryToSetLocale();
  FILE *in = fopen(automorf_path, "rb");
  load(in);
  initAnalysis();
  checkValidity();
  FILE *input = fopen("text_in.txt", "wb+");
  FILE *output = fopen("text_out.txt", "wb+");
  fputs(input_text, input);
  rewind(input);    
  analysis(input, output);
  // fputs("foobar", output);
  rewind(output);    
  fgets(output_text, 50, output);
  fclose(in);
  fclose(input);
  fclose(output);
}
sushain97 commented 5 years ago

Ah, I thought already you had it working with non-temporary files. I'm not really sure why your example doesn't work. Maybe https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L308-L311? Shouldn't be relevant unless you're on Windows though. You shouldn't need to open them in append mode but that shouldn't matter.

@TinoDidriksen @unhammer any ideas? Probably something obvious.

unhammer commented 5 years ago

What does the input contain? There are certain inputs that I've seen can lead lt-proc into endless loops (typically when stuff hasn't been through deformatters)

sushain97 commented 5 years ago

Hmm, @Vaydheesh was using 'cats' above which should work without a deformatter, no?

$ echo 'cats' | apertium-destxt | lt-proc -a eng.automorf.bin
^cats/cat<n><pl>$^./.<sent>$[][
]⏎
$ echo 'cats' | lt-proc -a eng.automorf.bin
^cats/cat<n><pl>$
sushain97 commented 5 years ago

Hmm... the issue could be that there's no terminating newline in the input file? e.g.

$ echo 'cats' > test
$ lt-proc -a eng.automorf.bin test
^cats/cat<n><pl>$
$ truncate -s -1 test
$ wc -c test
       4 test
$ lt-proc -a eng.automorf.bin test
$ # NO OUTPUT

Maybe ensure that you have one?

singh-lokendra commented 5 years ago

What does the input contain

I have tried using "cats" and "cats\0" and "cat" and "cat\0"

sushain97 commented 5 years ago

A null byte in a text file sounds like it could cause all sorts of issues. Given that our earlier problem was no output and when I try using a file without a (trailing) newline, I get no output, that sounds suspiciously like it might be your issue.

sushain97 commented 5 years ago

Note that passing the file through a deformatter first fixes it even when there is no newline:

$ cat test | apertium-destxt | lt-proc -a eng.automorf.bin
^cats/cat<n><pl>$^./.<sent>$[]⏎                 
$ cat test | lt-proc -a eng.automorf.bin
$ # NO OUTPUT
unhammer commented 5 years ago

passing undeformatted text through lt-proc is pretty much Undefined Behaviour, so I would start by ensuring whatever input there is has been through e.g. apertium-destxt

sushain97 commented 5 years ago

Makes sense. Since we're just trying to verify the lt-proc -a wrapper, lets manually pass deformatted input.

singh-lokendra commented 5 years ago

Compiling the observations while working on various approaches:

So I tried, passing cats\n to previous python code, and this is the result '^cats/cat<n><pl>$\n'. The newline character can be stripped

Few Doubts:

sushain97 commented 5 years ago
  • When input is written on file stream using c++, irrespective of its mode, the file stream needs to be reopened, before passing to analysis. Using fflush() doesn't work
  • If input file isn't reopened, then it gives continuous output, fflush() not working.
  • If data is to be read from output file, it needs to be reopened as well, rewind() doesn't work

This is all very odd. However, I'm fine leaving it a bit sub-optimal for now and getting something working first. We can open issues to clean it up.

So I tried, passing cats\n to previous python code, and this is the result '^cats/cat$\n'. The newline character can be stripped

Don't strip or modify anything. The input to lt-proc -a should always be the output of a deformatter, e.g. apertium-destext. This can be seen in the current Python: https://github.com/apertium/apertium-python/blob/master/apertium/analysis/__init__.py#L63. For now, we can leave this subprocess call intact. Eventually, we'll either need to wrap it as well or replicate its functionality in Python (probably the former).

when passing file descriptor instead of file path, facing the issue of no output. Should I fix this, or just pass the path of temporary files?

Let's just pass around file paths for now, we can fix this later as well. We'll file an issue once we merge the code that works with file paths.

Should I start working on wrapper for the apertium-destxt?

Let's punt this until we get the lt-proc one completely working and integrated. My hope is that you'll be able to churn these things out faster once you get one fully piped through :)

Should I raise an error as well? If yes, then what should be the error message?

For now, let's just raise a generic Exception with something like "FST invalid". I think that we can redirect stderr to a PIPE prior to running the C++, capturing the output and then using that to form the error message. Another thing for an issue.

Good progress! Looking forward to an updated PR soon. Note that if you want, you can submodule lttoolbox in this repo, just to make development/testing easier for other folks. It'll also serve as a good marker for what commit things have been validated to work. I don't know exactly how the SWIG build process looks so ignore me if that doesn't make sense (I do want to see it in a PR though).

singh-lokendra commented 5 years ago
sushain97 commented 5 years ago
  • Will submodule lttoolbox. After completing the wrapper, maybe we could directly clone latest stable release during the installation process.

Unfortunately, that won't work. If you run pip install apertium, you should 100% not be forced to download and all compile Apertium from source. Eventually, the Makefile and interfaces need to go into the lttoolbox repo itself and the result included with releases (similar to hfst). However, until we have a stable wrapper surface for the API, we can use a submodule just to make it easier for testing, etc.

  • Will integrate the library with the output of apertium-destxt

This is already done by the snippet I referenced.

Good luck with your exam!

singh-lokendra commented 5 years ago
sushain97 commented 5 years ago
  • Finally got the autotools to build the wrapper

Awesome!

You should use setuptools if possible. It's the recommended way to do this stuff. What you're looking at is probably old and outdated.

Huh? I'm more interested in testing the apertium python library in this repo rather than the wrappers themselves.

Also, I'm guessing your exams are over now? I hope they went well. I'll be expecting daily updates again now :) Looking forward to your next couple PRs so I can try testing things.

sushain97 commented 5 years ago

@Vaydheesh how's it going? Is it possible for you to split up the PR like I mentioned so that I could take a peek at things?

singh-lokendra commented 5 years ago

My college is over now and I can devote all of my time for this project. I have submitted the PR for review in lttoolbox.

sushain97 commented 5 years ago

Sounds good! Taking a look right now. Can you also amend the PR in this repo?

sushain97 commented 5 years ago

@Vaydheesh any updates? Don't forget to post daily updates or we'll have to schedule video chats :)

singh-lokendra commented 5 years ago

Oops.. Really forgot about the daily updates. Here is what I've been working on.

sushain97 commented 5 years ago

Oops.. Really forgot about the daily updates.

Please be more careful on this note :)

I'm not really certain what's going on with the Windows paste you provided. I would try moving from the hello world example that works as slowly as possible to the lttoolbox version, i.e. in baby steps. Figuring out exactly where things start breaking is the key.

Your PR I reviewed looks on the right track. I left some comments. I just landed a bunch of changes in master switching us to the pipenv world and such. Please update your branch. There should be a couple conflicts but nothing serious.

sushain97 commented 5 years ago

@Vaydheesh any luck?

singh-lokendra commented 5 years ago

Trying to find the cause of import error on linux. Will switch to windows after getting things to work on windows