GSoC 2019 Progress Updates

sushain97 commented 5 years ago

This is an issue so that notifications come through nicely.

singh-lokendra commented 5 years ago

Black: The Uncompromising Code Formatter

max-line-length is set to 160 in .flake8, whereas default support is for 88. Should I modify it to 160?
- black uses double quotes, so it'll replace single quotes
- Regarding its integration with the project, few changes would be required
- use pre-commit to add hooks, so that black runs automatically every time commits are made. Every new developer would need to run pre-commit install to install the hooks.
- Add: .pre-commit-config.yaml, Should I use python3.6 for language_version?
- black --check project_path/ can be used to check the code during CI tests. If some changes are required then exit code would be 1 and builds will fail.
- Sample output of commits:
  
  $ git commit -m 'test_commit' [WARNING] Unstaged files detected. [INFO] Stashing unstaged files to C:\Users\admin/.cache\pre-commit\patch1557367025. black....................................................................Passed Flake8...................................................................Passed [INFO] Restored changes from C:\Users\admin/.cache\pre-commit\patch1557367025. [master c2f7890] test_commit 1 file changed, 10 insertions(+), 14 deletions(-)

Should I go ahead and integrate Black with apertium-python? A badge is also available for ReadMe file
Currently there are a lot of branches in the project. Should I commit changes to master branch each time, or use some other branch for commiting the code?

sushain97 commented 5 years ago

Personally, I think 88 columns is too short. However, I really don't care that much. Also don't care about single/double quotes (despite having my own preference for the former). Don't worry about setting up commit hooks for now.

Should I go ahead and integrate Black with apertium-python? A badge is also available for ReadMe file

As far as checking goes, you can send a PR to switch the flake8 checks to black. This also requires changing the requirements-dev.txt (we should really just switch to pipenv).

Currently there are a lot of branches in the project. Should I commit changes to master branch each time, or use some other branch for commiting the code?

Use your own branches and send PRs to master that I'll review.

We really shouldn't spend any more time on this than the one day that we've already spent. The existing setup is perfectly fine so please don't spend more than an hour or two on changing style checking stuff tomorrow. I'd much rather you get started on the real work and we worry about this later.

singh-lokendra commented 5 years ago

I've put the black code formatter on hold for now. I'll use it locally and ensure that linter checks are also passed
I have started working on apertium.analyzers Its calling eng-morph.mode which in turn calls lt-proc.exe So I am looking at its source . I am assuming this to be a good starting point
The lt-proc is dependent on a lot files. Just the source of fst_processor gave me goosebumps :D
I've read the SWIG Python documentation to get a basic idea of handling various cases.
Now understanding the flow of control, before starting the work on swig wrapper
Searching the logs directed me to swig interface file for hfst. This might be helpful.

Few Queries:

I am confused regarding the approach to be taken for making the swig wrapper. Should I start with lt-proc and make wrappers for all the files included or sequentially write wrappers for every file in lttoolbox
Regarding the complilation of wrappers which approach should be used:
- Using disutils Imo this should be used as it can handle various cases
- Dynamic linking Need to compile each file seperately. Would require some sort of makefile. And specified naming convention needs to be followed, else import error while importing the python library.
- Static linking Not even SWIG suggests using this one

sushain97 commented 5 years ago

I've put the black code formatter on hold for now. I'll use it locally and ensure that linter checks are also passed

SGTM.

I have started working on apertium.analyzers Its calling eng-morph.mode which in turn calls lt-proc.exe So I am looking at its source . I am assuming this to be a good starting point

Exactly where I would start.

The lt-proc is dependent on a lot files. Just the source of fst_processor gave me goosebumps :D

:D I think to start we'll only need to make the edges of the interface usable in Python? All of the FST processor itself doesn't need to be visible from Python.

I am confused regarding the approach to be taken for making the swig wrapper. Should I start with lt-proc and make wrappers for all the files included or sequentially write wrappers for every file in lttoolbox

My perspective is that we should start with wrapping as little as possible in order to get e.g. eng-morph.mode working without subprocesses. That definitely shouldn't require wrapping every single file in lttoolbox. I'm not particularly familiar with SWIG but I think this should be possible? @unhammer / @TinoDidriksen could confirm?

Using disutils Imo this should be used as it can handle various cases

Dynamic linking Need to compile each file seperately. Would require some sort of makefile. And specified naming convention needs to be followed, else import error while importing the python library.

Since distutils is part of the standard library, I'm perfectly fine with this approach. However, given that there's already a build system in place for lttoolbox (probably nothing too unconventional -- https://github.com/apertium/lttoolbox/blob/master/Makefile.am), you might have to use the dynamic linking approach.

Searching the logs directed me to swig interface file for hfst. This might be helpful.

This should serve as a great example. You might even want to copy its approach as much as possible. Take a look at the README: https://github.com/hfst/hfst/tree/master/python.

sushain97 commented 5 years ago

Note that if we did end up wrapping a large portion of lttoolbox based on some desire for such functionality, we should release that separately on PyPi and then depend on it here in the apertium module.

TinoDidriksen commented 5 years ago

Regarding SWIG: The needed functions are the ones that https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc calls - grep for fstp. and LtLocale::. You shouldn't need to wrap more than that for now.

sushain97 commented 5 years ago

Makes sense. The MVP is wrapping every function that lt_proc.cc calls when invoked with -n. A quick look makes that seem like:

https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L111 https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L230 https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L245 https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L270 https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L398-L400

Note that all the file operations can just be done in Python since that shims out to 'native' code anyway.

singh-lokendra commented 5 years ago

Instead of creating swig wrapper for every file, focus only on lt-proc
The code is separated in different files, but no need to write wrapper for all of them. Unable to find something relevant to this problem in the docs.
Hence implemented a working example of the above problem.
Working example: foo() is defined in foo.cpp, which calls bar(), defined in bar.cpp Successfully implemented a wrapper for foo.cpp.
ToDo: Check the above implementation for classes, class is defined in a different file, and object is created in another file. This would be required for class FSTProcessor.
All the examples in documentation have function declaration in a seperate header file. Implemented swig interface file for function declaration in the cpp file. Required for wrapping endProgram() and checkValidity()
While working on an example for wrapping main(), ran into unexpected output. If the function name is modified to main2(), desired output is available. Will look into this and report back.
All header files require an interface file. SWIG provides header files for some standard libraries.
out of cstdlib, getopt, iostream and libgen the interface file is available only for iostream. Will look into usage of each header file, and implement an interface file for required header files.

sushain97 commented 5 years ago

Working example: foo() is defined in foo.cpp, which calls bar(), defined in bar.cpp Successfully implemented a wrapper for foo.cpp.

Great!

Required for wrapping endProgram() and checkValidity()

There's no real need to wrap endProgram() or checkValidity(). You can just replicate what they do in Python. I suspect that if we wanted to wrap them, we we need to add a function declaration to them in the header file.

While working on an example for wrapping main(), ran into unexpected output. If the function name is modified to main2(), desired output is available. Will look into this and report back.

Why are we wrapping main? It looks to me like the SWIG docs recommend against it and I don't think it's needed either. The lines I noted above are really all that matters. We would need to replicate calls to them inside the Python but that's fine (especially for an MVP).

All header files require an interface file. SWIG provides header files for some standard libraries.

We don't need to actually call any code that hits getopt or libgen?

TinoDidriksen commented 5 years ago

The SWIG .i file only needs to list the functions we want exported. So just don't list things from cstdlib or getopt or etc. Basically grep out the FST calls from lt_proc.cc and stuff those in a .i which includes the library headers in %{ %}.

The idea was never to wrap lt_proc.cc - it was to look in lt_proc.cc for what needs to be wrapped.

sushain97 commented 5 years ago

@Vaydheesh any luck? Tino's advice should be helpful.

singh-lokendra commented 5 years ago

I had exam on 14 and my next exam is tomorrow. So couldn't look into it. I'll try this after the exam

sushain97 commented 5 years ago

Sounds good. Best of luck on your exams!

singh-lokendra commented 5 years ago

Working on a wrapper that can copy bare functionality of lt-proc -w
While linking the object files with shared library, use flags for libxml
Demangled undefined symbols in shared library using c++filt
Resolved various undefined symbols in shared library using nm -Cu

sushain97 commented 5 years ago

Hope your exams went well. That sounds like a good start!

I'd like to see a PR soon (maybe tomorrow) even if it's a WIP. Tino and/or I can take a quick look to double check that you're not going down the wrong path.

edit: note that everything should probably go in a Makefile, a la https://github.com/hfst/hfst/blob/master/python/Makefile.am

sushain97 commented 5 years ago

@Vaydheesh how's it going?

singh-lokendra commented 5 years ago

I've implemented the file descriptor approach
When I call init_analysis() for existing files, with text to be analysed present in the file, I am getting desired output
mode used for input and output files are r and w respectively.
If I create new file, write the text to be analysed in input_file, call init_analysis(), the output is not flushed to output file.
I've tried using r+ and w+ mode for input file, and w, r+, w+ for output file.
Trying to fix this issue, by isolating the function calls

Also,

Is it possible to get the mode in which a file stream is opened in C++, python equivalent is open('foo.txt').mode

sushain97 commented 5 years ago

I'm a bit confused. You should be using tempfile to create new temporary output and input files for each call. They should never be reused (at least for now). If https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryFile isn't working well for you, try https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile.

singh-lokendra commented 5 years ago

create new temporary output and input files for each call.

I am using separate files for input and output

I started my work with tempfile, that didn't work
Hence I tried with this, but it didn't work as well

If I create new file, write the text to be analysed in input_file, call init_analysis(), the output is not flushed to output file.

Only then I tried using

When I call init_analysis() for existing files, with text to be analysed present in the file, I am getting desired output

Now I am trying to find the cause using brute force
I'll implement temporary file in the actual code

sushain97 commented 5 years ago

Hmm, okay. Let me know if I can help (what's going wrong?). Something straightforward like

with tempfile.NamedTemporaryFile('w') as input_file, tempfile.NamedTemporaryFile('r') as output_file:
    input_file.write(input_text)
    input_file.flush() # might not be needed
    init_analysis(automorf_path, input_file.name, output_file.name)
    output = output_file.read()

If you can get the above working, don't worry about file descriptors for now. I'd rather get something to work and then improve it than immediately trying to jump to the optimal approach.

singh-lokendra commented 5 years ago

This is what I've run on interpreter

import tempfile
import analysis
with tempfile.NamedTemporaryFile('w') as input_file, tempfile.NamedTemporaryFile('r') as output_file:
    input_text = 'cats'
    automorf_path = "/usr/share/apertium/apertium-eng/eng.automorf.bin"
    input_file.write(input_text)
    input_file.flush() # might not be needed
    x = analysis.FST()
    x.init_analysis(automorf_path, input_file.name, output_file.name)
    output = output_file.read()
    print(output)
    input() # pause

Output: 4 for writing 'cats'

When I checked the temporary file created in /tmp using cat $filename

One file has the text cats. file $filename prints /tmp/tmp5irc7w1t: ASCII text, with no line terminators
No output for the other file. file $filename prints /tmp/tmp704eeaz1: empty

sushain97 commented 5 years ago

print(output) shouldn't print 4. It should print whatever is in the file... Very odd.

I'm not sure if this will help at all but try using tempfile.NamedTemporaryFile('r', delete=False).

sushain97 commented 5 years ago

Also, instead of using input() to pause, do something like breakpoint(), then you can also inspect things.

singh-lokendra commented 5 years ago

print(output) shouldn't print 4. It should print whatever is in the file... Very odd

4 is output of input_file.write(input_text) print(output) doesn't print anything

singh-lokendra commented 5 years ago

print(output) shouldn't print 4. It should print whatever is in the file... Very odd.

I'm not sure if this will help at all but try using tempfile.NamedTemporaryFile('r', delete=False).

The file persists, but it is empty

sushain97 commented 5 years ago

Try adding some debugging to the C++ (or using gdb I suppose) to figure out whether the issue is that there's no input being read or the output isn't being written? You can also try using https://en.cppreference.com/w/cpp/io/c/tmpfile and passing in/out strings. I don't like that approach since it requires more code in the C++ wrapper but lets see if we can get something to work first.

singh-lokendra commented 5 years ago

When I use following code, input_text gets written on the text_in.txt file, but text_out.txt is being filled with character ?, with its size increasing continuously, due to which I have to kill the process

void init_analysis(char *automorf_path, char *input_text, char output_text[50])
{
  setDictionaryCaseMode(true);
  LtLocale::tryToSetLocale();
  FILE *in = fopen(automorf_path, "rb");
  load(in);
  initAnalysis();
  checkValidity();
  FILE *input = fopen("text_in.txt", "wb+");
  FILE *output = fopen("text_out.txt", "wb+");
  fputs(input_text, input);
  rewind(input);    
  analysis(input, output);
  // fputs("foobar", output);
  rewind(output);    
  fgets(output_text, 50, output);
  fclose(in);
  fclose(input);
  fclose(output);
}

Upon replacing analysis(input, output); with fputs("foobar", output); the output file has the desired content

sushain97 commented 5 years ago

Ah, I thought already you had it working with non-temporary files. I'm not really sure why your example doesn't work. Maybe https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L308-L311? Shouldn't be relevant unless you're on Windows though. You shouldn't need to open them in append mode but that shouldn't matter.

@TinoDidriksen @unhammer any ideas? Probably something obvious.

unhammer commented 5 years ago

What does the input contain? There are certain inputs that I've seen can lead lt-proc into endless loops (typically when stuff hasn't been through deformatters)

sushain97 commented 5 years ago

Hmm, @Vaydheesh was using 'cats' above which should work without a deformatter, no?

$ echo 'cats' | apertium-destxt | lt-proc -a eng.automorf.bin
^cats/cat<n><pl>$^./.<sent>$[][
]⏎
$ echo 'cats' | lt-proc -a eng.automorf.bin
^cats/cat<n><pl>$

sushain97 commented 5 years ago

Hmm... the issue could be that there's no terminating newline in the input file? e.g.

$ echo 'cats' > test
$ lt-proc -a eng.automorf.bin test
^cats/cat<n><pl>$
$ truncate -s -1 test
$ wc -c test
       4 test
$ lt-proc -a eng.automorf.bin test
$ # NO OUTPUT

Maybe ensure that you have one?

singh-lokendra commented 5 years ago

What does the input contain

I have tried using "cats" and "cats\0" and "cat" and "cat\0"

sushain97 commented 5 years ago

A null byte in a text file sounds like it could cause all sorts of issues. Given that our earlier problem was no output and when I try using a file without a (trailing) newline, I get no output, that sounds suspiciously like it might be your issue.

sushain97 commented 5 years ago

Note that passing the file through a deformatter first fixes it even when there is no newline:

$ cat test | apertium-destxt | lt-proc -a eng.automorf.bin
^cats/cat<n><pl>$^./.<sent>$[]⏎                 
$ cat test | lt-proc -a eng.automorf.bin
$ # NO OUTPUT

unhammer commented 5 years ago

passing undeformatted text through lt-proc is pretty much Undefined Behaviour, so I would start by ensuring whatever input there is has been through e.g. apertium-destxt

sushain97 commented 5 years ago

Makes sense. Since we're just trying to verify the lt-proc -a wrapper, lets manually pass deformatted input.

singh-lokendra commented 5 years ago

Compiling the observations while working on various approaches:

Input string needs to be newline terminated (text has to be formatted properly)
Non newline terminated string gives blank output.
When input is written on file stream using c++, irrespective of its mode, the file stream needs to be reopened, before passing to analysis. Using fflush() doesn't work
If input file isn't reopened, then it gives continuous output, fflush() not working.
If data is to be read from output file, it needs to be reopened as well, rewind() doesn't work

So I tried, passing cats\n to previous python code, and this is the result '^cats/cat<n><pl>$\n'. The newline character can be stripped

This goes onto show the need of proper formatting required by FSTProcessor::analysis()
when passing file descriptor instead of file path, facing the issue of no output. Should I fix this, or just pass the path of temporary files?

Few Doubts:

Should I start working on wrapper for the apertium-destxt?
As suggested, I have implemented the check_validity() in python. wcerror is raised here, https://github.com/apertium/lttoolbox/blob/master/lttoolbox/fst_processor.cc#L3259, Should I raise an error as well? If yes, then what should be the error message?

sushain97 commented 5 years ago

When input is written on file stream using c++, irrespective of its mode, the file stream needs to be reopened, before passing to analysis. Using fflush() doesn't work

If input file isn't reopened, then it gives continuous output, fflush() not working.

If data is to be read from output file, it needs to be reopened as well, rewind() doesn't work

This is all very odd. However, I'm fine leaving it a bit sub-optimal for now and getting something working first. We can open issues to clean it up.

So I tried, passing cats\n to previous python code, and this is the result '^cats/cat$\n'. The newline character can be stripped

Don't strip or modify anything. The input to lt-proc -a should always be the output of a deformatter, e.g. apertium-destext. This can be seen in the current Python: https://github.com/apertium/apertium-python/blob/master/apertium/analysis/__init__.py#L63. For now, we can leave this subprocess call intact. Eventually, we'll either need to wrap it as well or replicate its functionality in Python (probably the former).

when passing file descriptor instead of file path, facing the issue of no output. Should I fix this, or just pass the path of temporary files?

Let's just pass around file paths for now, we can fix this later as well. We'll file an issue once we merge the code that works with file paths.

Should I start working on wrapper for the apertium-destxt?

Let's punt this until we get the lt-proc one completely working and integrated. My hope is that you'll be able to churn these things out faster once you get one fully piped through :)

Should I raise an error as well? If yes, then what should be the error message?

For now, let's just raise a generic Exception with something like "FST invalid". I think that we can redirect stderr to a PIPE prior to running the C++, capturing the output and then using that to form the error message. Another thing for an issue.

Good progress! Looking forward to an updated PR soon. Note that if you want, you can submodule lttoolbox in this repo, just to make development/testing easier for other folks. It'll also serve as a good marker for what commit things have been validated to work. I don't know exactly how the SWIG build process looks so ignore me if that doesn't make sense (I do want to see it in a PR though).

singh-lokendra commented 5 years ago

Will submodule lttoolbox. After completing the wrapper, maybe we could directly clone latest stable release during the installation process.
Working on makefile for building the wrapper.
Will integrate the library with the output of apertium-destxt
I also have exam tomorrow. So I'll continue the work after that

sushain97 commented 5 years ago

Will submodule lttoolbox. After completing the wrapper, maybe we could directly clone latest stable release during the installation process.

Unfortunately, that won't work. If you run pip install apertium, you should 100% not be forced to download and all compile Apertium from source. Eventually, the Makefile and interfaces need to go into the lttoolbox repo itself and the result included with releases (similar to hfst). However, until we have a stable wrapper surface for the API, we can use a submodule just to make it easier for testing, etc.

Will integrate the library with the output of apertium-destxt

This is already done by the snippet I referenced.

Good luck with your exam!

singh-lokendra commented 5 years ago

Finally got the autotools to build the wrapper
Now working on the unittests https://github.com/apertium/lttoolbox/tree/master/tests/lt_proc
Right now I am using disutils because swig has documentation of the same http://swig.org/Doc1.3/Python.html#Python_nn6 and libdivvun is also using disutils https://github.com/divvun/libdivvun/blob/master/python/setup.py.in#L7. Should I use setuptools instead of disutils?

sushain97 commented 5 years ago

Finally got the autotools to build the wrapper

Awesome!

Right now I am using disutils because swig has documentation of the same http://swig.org/Doc1.3/Python.html#Python_nn6 and libdivvun is also using disutils https://github.com/divvun/libdivvun/blob/master/python/setup.py.in#L7. Should I use setuptools instead of disutils?

You should use setuptools if possible. It's the recommended way to do this stuff. What you're looking at is probably old and outdated.

Now working on the unittests https://github.com/apertium/lttoolbox/tree/master/tests/lt_proc

Huh? I'm more interested in testing the apertium python library in this repo rather than the wrappers themselves.

Also, I'm guessing your exams are over now? I hope they went well. I'll be expecting daily updates again now :) Looking forward to your next couple PRs so I can try testing things.

sushain97 commented 5 years ago

@Vaydheesh how's it going? Is it possible for you to split up the PR like I mentioned so that I could take a peek at things?

singh-lokendra commented 5 years ago

My college is over now and I can devote all of my time for this project. I have submitted the PR for review in lttoolbox.

sushain97 commented 5 years ago

Sounds good! Taking a look right now. Can you also amend the PR in this repo?

sushain97 commented 5 years ago

@Vaydheesh any updates? Don't forget to post daily updates or we'll have to schedule video chats :)

singh-lokendra commented 5 years ago

Oops.. Really forgot about the daily updates. Here is what I've been working on.

At first I tried using autotools port for windows, but that was taking a lot of time, so I switched over to use simple commands for building the shared library.
MinGW reported error for to_wstring, which should've been patched few years ago. Hence I switch over to Nuven MinGW, which supports only 64 bit compilation.
After compiling the lttoolbox wrapper, importing it in python, reported that it could not find the required module. Apparently the swig documentation isn't updated regarding extension of shared library on windows-python. The required extension is *.pyd
After successfully compiling the hello-world of swig, for both 32 bit and 64 bit, I moved back to compiling the lttoolbox wrapper. But got stuck at a dead end. I've described the issue here, https://pastebin.com/BzH6M7Gm, and tried contacting unhammer on IRC
In the meantime I was working on implementing the pipe to function call. As you had suggested, I was working on regex to parse the mode files, to handle all the corner cases, only to realise that it was already implemented by arghya.
Now I am finishing the task and will update the PR within an hour or two.

sushain97 commented 5 years ago

Oops.. Really forgot about the daily updates.

Please be more careful on this note :)

I'm not really certain what's going on with the Windows paste you provided. I would try moving from the hello world example that works as slowly as possible to the lttoolbox version, i.e. in baby steps. Figuring out exactly where things start breaking is the key.

Your PR I reviewed looks on the right track. I left some comments. I just landed a bunch of changes in master switching us to the pipenv world and such. Please update your branch. There should be a couple conflicts but nothing serious.

sushain97 commented 5 years ago

@Vaydheesh any luck?

singh-lokendra commented 5 years ago

Trying to find the cause of import error on linux. Will switch to windows after getting things to work on windows

apertium / apertium-python

GSoC 2019 Progress Updates #40