Open sushain97 opened 5 years ago
Black: The Uncompromising Code Formatter
pre-commit install
to install the hooks.black --check project_path/
can be used to check the code during CI tests. If some changes are required then exit code would be 1 and builds will fail.$ git commit -m 'test_commit' [WARNING] Unstaged files detected. [INFO] Stashing unstaged files to C:\Users\admin/.cache\pre-commit\patch1557367025. black....................................................................Passed Flake8...................................................................Passed [INFO] Restored changes from C:\Users\admin/.cache\pre-commit\patch1557367025. [master c2f7890] test_commit 1 file changed, 10 insertions(+), 14 deletions(-)
Personally, I think 88 columns is too short. However, I really don't care that much. Also don't care about single/double quotes (despite having my own preference for the former). Don't worry about setting up commit hooks for now.
Should I go ahead and integrate Black with apertium-python? A badge is also available for ReadMe file
As far as checking goes, you can send a PR to switch the flake8
checks to black
. This also requires changing the requirements-dev.txt
(we should really just switch to pipenv
).
Currently there are a lot of branches in the project. Should I commit changes to master branch each time, or use some other branch for commiting the code?
Use your own branches and send PRs to master that I'll review.
We really shouldn't spend any more time on this than the one day that we've already spent. The existing setup is perfectly fine so please don't spend more than an hour or two on changing style checking stuff tomorrow. I'd much rather you get started on the real work and we worry about this later.
Few Queries:
import error
while importing the python library.
- I've put the black code formatter on hold for now. I'll use it locally and ensure that linter checks are also passed
SGTM.
- I have started working on apertium.analyzers Its calling eng-morph.mode which in turn calls lt-proc.exe So I am looking at its source . I am assuming this to be a good starting point
Exactly where I would start.
- The lt-proc is dependent on a lot files. Just the source of fst_processor gave me goosebumps :D
:D I think to start we'll only need to make the edges of the interface usable in Python? All of the FST processor itself doesn't need to be visible from Python.
I am confused regarding the approach to be taken for making the swig wrapper. Should I start with lt-proc and make wrappers for all the files included or sequentially write wrappers for every file in lttoolbox
My perspective is that we should start with wrapping as little as possible in order to get e.g. eng-morph.mode
working without subprocesses. That definitely shouldn't require wrapping every single file in lttoolbox. I'm not particularly familiar with SWIG but I think this should be possible? @unhammer / @TinoDidriksen could confirm?
- Using disutils Imo this should be used as it can handle various cases
- Dynamic linking Need to compile each file seperately. Would require some sort of makefile. And specified naming convention needs to be followed, else
import error
while importing the python library.
Since distutils
is part of the standard library, I'm perfectly fine with this approach. However, given that there's already a build system in place for lttoolbox (probably nothing too unconventional -- https://github.com/apertium/lttoolbox/blob/master/Makefile.am), you might have to use the dynamic linking approach.
- Searching the logs directed me to swig interface file for hfst. This might be helpful.
This should serve as a great example. You might even want to copy its approach as much as possible. Take a look at the README: https://github.com/hfst/hfst/tree/master/python.
Note that if we did end up wrapping a large portion of lttoolbox
based on some desire for such functionality, we should release that separately on PyPi and then depend on it here in the apertium
module.
Regarding SWIG: The needed functions are the ones that https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc calls - grep for fstp.
and LtLocale::
. You shouldn't need to wrap more than that for now.
Makes sense. The MVP is wrapping every function that lt_proc.cc
calls when invoked with -n
. A quick look makes that seem like:
https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L111 https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L230 https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L245 https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L270 https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L398-L400
Note that all the file operations can just be done in Python since that shims out to 'native' code anyway.
Working example: foo() is defined in foo.cpp, which calls bar(), defined in bar.cpp Successfully implemented a wrapper for foo.cpp.
Great!
Required for wrapping endProgram() and checkValidity()
There's no real need to wrap endProgram()
or checkValidity()
. You can just replicate what they do in Python. I suspect that if we wanted to wrap them, we we need to add a function declaration to them in the header file.
While working on an example for wrapping main(), ran into unexpected output. If the function name is modified to main2(), desired output is available. Will look into this and report back.
Why are we wrapping main
? It looks to me like the SWIG docs recommend against it and I don't think it's needed either. The lines I noted above are really all that matters. We would need to replicate calls to them inside the Python but that's fine (especially for an MVP).
All header files require an interface file. SWIG provides header files for some standard libraries.
We don't need to actually call any code that hits getopt
or libgen
?
The SWIG .i
file only needs to list the functions we want exported. So just don't list things from cstdlib or getopt or etc. Basically grep out the FST calls from lt_proc.cc
and stuff those in a .i
which includes the library headers in %{ %}
.
The idea was never to wrap lt_proc.cc
- it was to look in lt_proc.cc
for what needs to be wrapped.
@Vaydheesh any luck? Tino's advice should be helpful.
I had exam on 14 and my next exam is tomorrow. So couldn't look into it. I'll try this after the exam
Sounds good. Best of luck on your exams!
lt-proc -w
c++filt
nm -Cu
Hope your exams went well. That sounds like a good start!
I'd like to see a PR soon (maybe tomorrow) even if it's a WIP. Tino and/or I can take a quick look to double check that you're not going down the wrong path.
edit: note that everything should probably go in a Makefile, a la https://github.com/hfst/hfst/blob/master/python/Makefile.am
@Vaydheesh how's it going?
init_analysis()
for existing files, with text to be analysed present in the file, I am getting desired outputmode
used for input and output files are r
and w
respectively.init_analysis()
, the output is not flushed to output file.r+
and w+
mode for input file, and w
, r+
, w+
for output file.Also,
open('foo.txt').mode
I'm a bit confused. You should be using tempfile
to create new temporary output and input files for each call. They should never be reused (at least for now). If https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryFile isn't working well for you, try https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile.
create new temporary output and input files for each call.
I am using separate files for input and output
- If I create new file, write the text to be analysed in input_file, call
init_analysis()
, the output is not flushed to output file.
- When I call
init_analysis()
for existing files, with text to be analysed present in the file, I am getting desired output
temporary
file in the actual codeHmm, okay. Let me know if I can help (what's going wrong?). Something straightforward like
with tempfile.NamedTemporaryFile('w') as input_file, tempfile.NamedTemporaryFile('r') as output_file:
input_file.write(input_text)
input_file.flush() # might not be needed
init_analysis(automorf_path, input_file.name, output_file.name)
output = output_file.read()
If you can get the above working, don't worry about file descriptors for now. I'd rather get something to work and then improve it than immediately trying to jump to the optimal approach.
This is what I've run on interpreter
import tempfile
import analysis
with tempfile.NamedTemporaryFile('w') as input_file, tempfile.NamedTemporaryFile('r') as output_file:
input_text = 'cats'
automorf_path = "/usr/share/apertium/apertium-eng/eng.automorf.bin"
input_file.write(input_text)
input_file.flush() # might not be needed
x = analysis.FST()
x.init_analysis(automorf_path, input_file.name, output_file.name)
output = output_file.read()
print(output)
input() # pause
Output:
4
for writing 'cats'
When I checked the temporary file created in /tmp
using cat $filename
cats
. file $filename
prints /tmp/tmp5irc7w1t: ASCII text, with no line terminators
file $filename
prints /tmp/tmp704eeaz1: empty
print(output)
shouldn't print 4
. It should print whatever is in the file... Very odd.
I'm not sure if this will help at all but try using tempfile.NamedTemporaryFile('r', delete=False)
.
Also, instead of using input()
to pause, do something like breakpoint()
, then you can also inspect things.
print(output)
shouldn't print4
. It should print whatever is in the file... Very odd
4
is output of input_file.write(input_text)
print(output)
doesn't print anything
print(output)
shouldn't print4
. It should print whatever is in the file... Very odd.I'm not sure if this will help at all but try using
tempfile.NamedTemporaryFile('r', delete=False)
.
The file persists, but it is empty
Try adding some debugging to the C++ (or using gdb I suppose) to figure out whether the issue is that there's no input being read or the output isn't being written? You can also try using https://en.cppreference.com/w/cpp/io/c/tmpfile and passing in/out strings. I don't like that approach since it requires more code in the C++ wrapper but lets see if we can get something to work first.
input_text
gets written on the text_in.txt
file, but text_out.txt
is being filled with character ?
, with its size increasing continuously, due to which I have to kill the processvoid init_analysis(char *automorf_path, char *input_text, char output_text[50])
{
setDictionaryCaseMode(true);
LtLocale::tryToSetLocale();
FILE *in = fopen(automorf_path, "rb");
load(in);
initAnalysis();
checkValidity();
FILE *input = fopen("text_in.txt", "wb+");
FILE *output = fopen("text_out.txt", "wb+");
fputs(input_text, input);
rewind(input);
analysis(input, output);
// fputs("foobar", output);
rewind(output);
fgets(output_text, 50, output);
fclose(in);
fclose(input);
fclose(output);
}
analysis(input, output);
with fputs("foobar", output);
the output file has the desired contentAh, I thought already you had it working with non-temporary files. I'm not really sure why your example doesn't work. Maybe https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L308-L311? Shouldn't be relevant unless you're on Windows though. You shouldn't need to open them in append mode but that shouldn't matter.
@TinoDidriksen @unhammer any ideas? Probably something obvious.
What does the input contain? There are certain inputs that I've seen can lead lt-proc into endless loops (typically when stuff hasn't been through deformatters)
Hmm, @Vaydheesh was using 'cats'
above which should work without a deformatter, no?
$ echo 'cats' | apertium-destxt | lt-proc -a eng.automorf.bin
^cats/cat<n><pl>$^./.<sent>$[][
]⏎
$ echo 'cats' | lt-proc -a eng.automorf.bin
^cats/cat<n><pl>$
Hmm... the issue could be that there's no terminating newline in the input file? e.g.
$ echo 'cats' > test
$ lt-proc -a eng.automorf.bin test
^cats/cat<n><pl>$
$ truncate -s -1 test
$ wc -c test
4 test
$ lt-proc -a eng.automorf.bin test
$ # NO OUTPUT
Maybe ensure that you have one?
What does the input contain
I have tried using "cats"
and "cats\0"
and "cat"
and "cat\0"
A null byte in a text file sounds like it could cause all sorts of issues. Given that our earlier problem was no output and when I try using a file without a (trailing) newline, I get no output, that sounds suspiciously like it might be your issue.
Note that passing the file through a deformatter first fixes it even when there is no newline:
$ cat test | apertium-destxt | lt-proc -a eng.automorf.bin
^cats/cat<n><pl>$^./.<sent>$[]⏎
$ cat test | lt-proc -a eng.automorf.bin
$ # NO OUTPUT
passing undeformatted text through lt-proc is pretty much Undefined Behaviour, so I would start by ensuring whatever input there is has been through e.g. apertium-destxt
Makes sense. Since we're just trying to verify the lt-proc -a
wrapper, lets manually pass deformatted input.
Compiling the observations while working on various approaches:
So I tried, passing cats\n
to previous python code, and this is the result '^cats/cat<n><pl>$\n'
. The newline character can be stripped
Few Doubts:
- When input is written on file stream using c++, irrespective of its mode, the file stream needs to be reopened, before passing to analysis. Using fflush() doesn't work
- If input file isn't reopened, then it gives continuous output, fflush() not working.
- If data is to be read from output file, it needs to be reopened as well, rewind() doesn't work
This is all very odd. However, I'm fine leaving it a bit sub-optimal for now and getting something working first. We can open issues to clean it up.
So I tried, passing cats\n to previous python code, and this is the result '^cats/cat
$\n'. The newline character can be stripped
Don't strip or modify anything. The input to lt-proc -a
should always be the output of a deformatter, e.g. apertium-destext
. This can be seen in the current Python: https://github.com/apertium/apertium-python/blob/master/apertium/analysis/__init__.py#L63. For now, we can leave this subprocess call intact. Eventually, we'll either need to wrap it as well or replicate its functionality in Python (probably the former).
when passing file descriptor instead of file path, facing the issue of no output. Should I fix this, or just pass the path of temporary files?
Let's just pass around file paths for now, we can fix this later as well. We'll file an issue once we merge the code that works with file paths.
Should I start working on wrapper for the apertium-destxt?
Let's punt this until we get the lt-proc
one completely working and integrated. My hope is that you'll be able to churn these things out faster once you get one fully piped through :)
Should I raise an error as well? If yes, then what should be the error message?
For now, let's just raise a generic Exception
with something like "FST invalid"
. I think that we can redirect stderr to a PIPE prior to running the C++, capturing the output and then using that to form the error message. Another thing for an issue.
Good progress! Looking forward to an updated PR soon. Note that if you want, you can submodule lttoolbox
in this repo, just to make development/testing easier for other folks. It'll also serve as a good marker for what commit things have been validated to work. I don't know exactly how the SWIG build process looks so ignore me if that doesn't make sense (I do want to see it in a PR though).
lttoolbox
. After completing the wrapper, maybe we could directly clone latest stable release during the installation process.apertium-destxt
- Will submodule
lttoolbox
. After completing the wrapper, maybe we could directly clone latest stable release during the installation process.
Unfortunately, that won't work. If you run pip install apertium
, you should 100% not be forced to download and all compile Apertium from source. Eventually, the Makefile
and interfaces need to go into the lttoolbox
repo itself and the result included with releases (similar to hfst
). However, until we have a stable wrapper surface for the API, we can use a submodule just to make it easier for testing, etc.
- Will integrate the library with the output of
apertium-destxt
This is already done by the snippet I referenced.
Good luck with your exam!
- Finally got the autotools to build the wrapper
Awesome!
- Right now I am using disutils because swig has documentation of the same http://swig.org/Doc1.3/Python.html#Python_nn6 and libdivvun is also using disutils https://github.com/divvun/libdivvun/blob/master/python/setup.py.in#L7. Should I use setuptools instead of disutils?
You should use setuptools
if possible. It's the recommended way to do this stuff. What you're looking at is probably old and outdated.
- Now working on the unittests https://github.com/apertium/lttoolbox/tree/master/tests/lt_proc
Huh? I'm more interested in testing the apertium
python library in this repo rather than the wrappers themselves.
Also, I'm guessing your exams are over now? I hope they went well. I'll be expecting daily updates again now :) Looking forward to your next couple PRs so I can try testing things.
@Vaydheesh how's it going? Is it possible for you to split up the PR like I mentioned so that I could take a peek at things?
My college is over now and I can devote all of my time for this project. I have submitted the PR for review in lttoolbox.
Sounds good! Taking a look right now. Can you also amend the PR in this repo?
@Vaydheesh any updates? Don't forget to post daily updates or we'll have to schedule video chats :)
Oops.. Really forgot about the daily updates. Here is what I've been working on.
At first I tried using autotools port for windows, but that was taking a lot of time, so I switched over to use simple commands for building the shared library.
MinGW reported error for to_wstring
, which should've been patched few years ago. Hence I switch over to Nuven MinGW
, which supports only 64 bit compilation.
After compiling the lttoolbox wrapper, importing it in python, reported that it could not find the required module. Apparently the swig documentation isn't updated regarding extension of shared library on windows-python
. The required extension is *.pyd
After successfully compiling the hello-world of swig, for both 32 bit and 64 bit, I moved back to compiling the lttoolbox wrapper. But got stuck at a dead end. I've described the issue here, https://pastebin.com/BzH6M7Gm, and tried contacting unhammer on IRC
In the meantime I was working on implementing the pipe to function call. As you had suggested, I was working on regex to parse the mode files, to handle all the corner cases, only to realise that it was already implemented by arghya.
Now I am finishing the task and will update the PR within an hour or two.
Oops.. Really forgot about the daily updates.
Please be more careful on this note :)
I'm not really certain what's going on with the Windows paste you provided. I would try moving from the hello world example that works as slowly as possible to the lttoolbox version, i.e. in baby steps. Figuring out exactly where things start breaking is the key.
Your PR I reviewed looks on the right track. I left some comments. I just landed a bunch of changes in master switching us to the pipenv world and such. Please update your branch. There should be a couple conflicts but nothing serious.
@Vaydheesh any luck?
Trying to find the cause of import error on linux. Will switch to windows after getting things to work on windows
This is an issue so that notifications come through nicely.