Import Toolbox Lexicon not working

goodmami commented 4 years ago

Migrated from Trac:

Original: http://lemur.ling.washington.edu/trac/matrix/ticket/809

Reporter: ebender

Created: 02/17/16 18:31:52

Updated: 02/17/16 18:31:52

Keywords:

Notes:

There may be formatting problems from the conversion

Comments on tickets are not migrated! See the original issue.

The Import Toolbox Lexicon functionality, while working during Winter quarter 2015, doesn't seem to be working as of Feb 2016. It could be that something broke, or it could be user error. If the latter, some more informative feedback could definitely help....

olzama commented 3 years ago

Currently, attempting to use Toolbox lexicon import via the web questionnaire results in an "Internal Server error". I suspect this is something to be fixed on @bmgraves side? However, I don't know what a valid input for that webpage (https://matrix.ling.washington.edu/customize/matrix.cgi?subpage=toolbox-import) looks like. @emilymbender do you recall who used Toolbox import successfully last?.. Maybe I could find a sample input file or a sample choices file or something like that?

In the unit tests (produced probably by @goodmami ?) I found the following:

test_toolbox_file = '''
\\_sh 3875.79755931259
\\_DateStampHasFourDigitYear 9810.18751251639

\\lex 5854.08114168001
\\id 2544
\\alt 9782.99945217209
\\alt 6582.93652709364
\\ps 2375.88860545642
\\ge 4078.86139952407
\\gn 9759.45441326154
\\semn 2731.26207749019
\\lg 6061.02519727528
\\hm 7812.27848508745
\\ed 4326.46532533433
\\dt 8537.16299666292
\\qst 4623.35583835525

\\lex 8228.66223825709
\\id 2545
\\ps 3098.78612024583
\\ge 1331.89585962842
\\gn 7980.84854729524
\\semn 6178.46659681572
\\lg 3102.28438655948
\\hm 306.451805607999
\\ed 3596.85670399401
\\dt 4426.19465094733
\\qst 1576.29018631404

-- but I am not sure what this represents?

P.S.: I am actually starting a project for which I would need the functionality, so I would like to fix it.

emilymbender commented 3 years ago

Not sure if this is the most recent instance, but I believe that this is from the choices file associated with Bender et al 2012:

section=toolbox-import toolboximportconfig1_idtag=\id toolboximportconfig1_glosstag=\ge toolboximportconfig1_stemtag=\lex toolboximportconfig1_bistemtag=\bip toolboximportconfig1_tbpredvalues=gloss toolboximportconfig1_importclass1_importlextype=noun1 toolboximportconfig1_importclass1_toolboxtag1_tbtag=\psrev toolboximportconfig1_importclass1_toolboxtag1_tbvalue=n toolboximportconfig1_importclass2_importlextype=verb1 toolboximportconfig1_importclass2_toolboxtag1_tbtag=\psrev toolboximportconfig1_importclass2_toolboxtag1_tbvalue=v toolboximportconfig1_importclass2_toolboxtag2_tbtag=\val toolboximportconfig1_importclass2_toolboxtag2_tbvalue=S-ABS V-S toolboximportconfig1_importclass3_importlextype=verb2 toolboximportconfig1_importclass3_toolboxtag1_tbtag=\psrev toolboximportconfig1_importclass3_toolboxtag1_tbvalue=v toolboximportconfig1_importclass3_toolboxtag2_tbtag=\val toolboximportconfig1_importclass3_toolboxtag2_tbvalue=A-ERG P-ABS V-A,P

olzama commented 3 years ago

Thanks, @emilymbender ! Would it be at all possible to get an entry from the actual toolbox file used in that project?.. If not I suppose I could try to create a fake one.

olzama commented 3 years ago

At any rate, I created a fake small file which I think should work for the spec from the choices above (toolbox format is very simple so I doubt there is much room for mistake there).

The choices spec can be successfully loaded in the questionnaire and the choices can then be modified and saved unless I add a Toolbox file (of any sort). Then I get the 500 Internal Server error (on the UW server side). So, I think first we need to get that out of the way, then see what other issues may exist.

@bmgraves , could you look into this? Repro (see also screenshot below). If you can fix the server issue, I can then proceed with seeing what else may be broken there.

1) open https://matrix.ling.washington.edu/customize/matrix.cgi 2) load the attached choices 3) try adding the attached toolbox file and save/import lexicon/create grammar

choices.txt fake-toolbox.txt

goodmami commented 3 years ago

attempting to use Toolbox lexicon import via the web questionnaire results in an "Internal Server error". I suspect this is something to be fixed on @bmgraves side?

Often, but not always, these internal server errors are when the server code (that is, the Python code behind the customization system) raises an exception and exits abnormally. Brandon could tell you what the error logs say but it's likely a problem for a Matrix developer to fix. Is it possible to run it locally so you can see the stack trace?

olzama commented 3 years ago

Is it possible to run it locally so you can see the stack trace?

Not for me. I don't know how to set up Matrix instances on localhost (tried to learn but failed). I mean, I could try again but it would be very inefficient, I think. There are so many different versions of everything (apache, OS...) that any documentation will likely turn out to be obsolete. I found it impossible to figure out what to consult, when I last tried...

olzama commented 3 years ago

Often, but not always, these internal server errors are when the server code (that is, the Python code behind the customization system) raises an exception and exits abnormally.

Come to think of it, the first thing to do is look for any encoding and/or file-opening related issues, in the code for toolbox import. I might try to do that tomorrow.

goodmami commented 3 years ago

Also if you have the toolbox file and can just run customize from the command line, you might not need to set up the full questionnaire.

bmgraves commented 3 years ago

Olga,

At a glance looking at the logs, this is a similar error to the encoding issues we were running into with the choices files, and may be a byproduct of the python3 conversion, but I can only currently look at the apache errors which are vague.

For the choices file, i was getting more specific python errors by running the code on the server using the instructions in CONTRIBUTTING.md

-- export REQUEST_METHOD=GET

export QUERY_STRING="customize=x&delivery=tgz" export HTTP_COOKIE=session=000

I'm not familiar enough with matrix to assume hunt down the correct query_string to do the same thing with the toolbox file. If you can provide me with that information I should be able to get a more specific error for this.

On Tue, Jun 29, 2021 at 6:50 AM Michael Wayne Goodman < @.***> wrote:

Also if you have the toolbox file and can just run customize from the command line, you might not need to set up the full questionnaire.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/delph-in/matrix/issues/277#issuecomment-870618619, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC54VGP3EOC7JLS56UYNNRTTVHFUFANCNFSM4NFDUGEQ .

olzama commented 3 years ago

I updated the two open() statements that I found in toolboximport.py to specify utf-8 encoding but that did not help. For now I am out of ideas. @goodmami , do you remember how to run toolbox import via command line? I have the choices file and the toolbox file but what is the command? In particular, where does the toolbox file go? (I couldn't find any documentation for that).

olzama commented 3 years ago

I guess I found this:

    elif args[0] == 'import-lex':
        import gmcs.linglib.toolboximport
        gmcs.linglib.toolboximport.import_toolbox_lexicon(args[1])

olzama commented 3 years ago

Oh. Ok what I am missing is the choices file syntax for toolbox file config. I don't know what it is (and to have to reconstruct from python code sounds like maybe a bit much at the moment) and I can't get it from the questionnaire because it's broken.

@emilymbender , in that choices file that you gave me a piece from above, is there also a "toolboxfile" portion? That could be super helpful.

olzama commented 3 years ago

(I opened a new issue for the questionnaire. Let's continue discussing any non-questionnaire issues related to lexicon import here.)

emilymbender commented 3 years ago

The uploading itself doesn't leave any trace in the choices file, or rather, that action leads the system to populate the choices file with a whole bunch of lexical entries but doesn't record the file name or similar itself.

olzama commented 3 years ago

Then @goodmami might be wrong about the option to test this via command line? No way to import the lexicon without the questionnaire?

emilymbender commented 3 years ago

Possibly not -- I think the code behind the 'upload' button is independent of customize.py. So it would require mimicking the upload functionality. (But this is all from memory; I'm not looking at the code just now.)

olzama commented 3 years ago

From the code, it looks like it maybe should be possible to have the file path as part of choices:

[...]
    for config in choices['toolboximportconfig']:
        [...]
        for tbfile in config.get('toolboxfile'):
            if not tbfile.get('tbfilename'):
                continue
            tblex = open(tbfile.get('tbfilename'), 'r')
           [..]
            tblex.close()

So maybe at this point, I need to just reconstruct the choices syntax from this code :)

Should be maybe something like: toolboximportconfig_toolboxfile_tbfilename...

bmgraves commented 3 years ago

That may be the way to go, but have been looking through some code and I think I see something that may help. Let me run a couple of more tests, and I will get back with hopefully some more information at least.

On Tue, Jun 29, 2021 at 1:00 PM Olga Zamaraeva @.***> wrote:

From the code, it looks like it maybe should be possible to have the file path as part of choices:

for config in choices['toolboximportconfig']:
    idtag = config.get('idtag')
    stemtag = config.get('stemtag')
    bistemtag = config.get('bistemtag')
    glosstag = config.get('glosstag')
    predchoice = config.get('tbpredvalues')
    lexclasses = config.get('importclass')
    starttag = config.get('starttag')
    # FIXME: Surely need a path here.  Also, the current
    # questionnaire allows multiple Toolbox files, need
    # to iterate trhough them.

    for tbfile in config.get('toolboxfile'):
        if not tbfile.get('tbfilename'):
            continue
        tb_lines = None
        tblex = open(tbfile.get('tbfilename'), 'r')
        tbentry = {}
        # List of values of the bistemtag field.
        affixes = []

        # Go through lexicon file only once, as it could
        # be quite large.  For each entry in the lexicon,
        # iterate through the lexclasses to see if it matches
        # any of them, and if so, import.

        for line in tblex.readlines():
            # Assume that the Toolbox tags may occur in any order
            # within an entry, but that they never repeat within
            # an entry.  In other words, when we see the same tag
            # again, that means we've hit a new entry, and we
            # should process the previous one then reset tbentry.
            words = line.split()
            if words:
                if words[0] == starttag:
                    affixes = process_tb_entry(
                        tbentry, lexclasses, stemtag, bistemtag, glosstag, predchoice, choices, affixes, form_data, form_data_entries)
                    if 'imported-entry'+str(form_data_entries)+'_orth' in form_data:
                        form_data_entries += 1
                    tbentry = {}
                    tbentries += 1
                tbentry[words[0]] = ' '.join(words[1:])

        tblex.close()

So maybe at this point, I need to just reconstruct the choices syntax from this code :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/delph-in/matrix/issues/277#issuecomment-870875504, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC54VGMJ5J3ENBRBOCVNA3DTVIQ47ANCNFSM4NFDUGEQ .

olzama commented 3 years ago

So I am currently running the import code directly, by using the following choices file (scroll to see the toolbox section):

[minimal grammar here...]

section=toolbox-import
  toolboximportconfig1_idtag=\id
toolboximportconfig1_starttag=\id
  toolboximportconfig1_glosstag=\ge
  toolboximportconfig1_stemtag=\lex
  toolboximportconfig1_tbpredvalues=gloss
    toolboximportconfig1_toolboxfile1_tbfilename=/Users/olzama/Desktop/fake-toolbox.txt
    toolboximportconfig1_importclass1_importlextype=noun1
      toolboximportconfig1_importclass1_toolboxtag1_tbtag=\psrev
      toolboximportconfig1_importclass1_toolboxtag1_tbvalue=n

section=test-sentences

section=gen-options

section=ToolboxLexicon

and the following mock toolbox file:

\id 1
\lex nounlexeme1
\ge noungloss1
\psrev n

Now, I don't know if I am using correct input, that's a big issue, so if anyone notices anything in my input, that could be sufficient to fix everything?

As it is, the code opens the toolbox file properly, so that's not a problem, the code to get a file path from the choices was already in place. However after that, the program goes into an infinite loop, which may be the issue that was originally reported in this ticket. The infinite loop is due to the program never being able to populate any entry and then trying to find something in the existing form.data that is not there. I couldn't understand completely clearly why searching in the form.data results in an infinite loop; maybe that's the bit that's impossible without the web questionnaire?

The infinite loop happens here:

    def __getitem__(self, key):
        if key in self.data:
            return self.data[key]
        else:
            self.data[key] = FormInfo(key, None)
            return self.data[key]

It does sound to me like this is a web-intended behaviour (just stay there until the user enters something?)

Anyway, without really knowing what the correct input is, may be on a wild goose chase...

bmgraves commented 3 years ago

Looks like this is more promising than the wrong path I've been trying to follow. I'll switch gears and try to take a look at this bit.

On Tue, Jun 29, 2021 at 2:00 PM Olga Zamaraeva @.***> wrote:

So I am currently running the import code directly, by using the following choices file (scroll to see the toolbox section):

[minimal grammar here...]

section=toolbox-import toolboximportconfig1_idtag=\id toolboximportconfig1_starttag=\id toolboximportconfig1_glosstag=\ge toolboximportconfig1_stemtag=\lex toolboximportconfig1_tbpredvalues=gloss toolboximportconfig1_toolboxfile1_tbfilename=/Users/olzama/Desktop/fake-toolbox.txt toolboximportconfig1_importclass1_importlextype=noun1 toolboximportconfig1_importclass1_toolboxtag1_tbtag=\psrev toolboximportconfig1_importclass1_toolboxtag1_tbvalue=n

section=test-sentences

section=gen-options

section=ToolboxLexicon

and the following mock toolbox file:

\id 1 \lex nounlexeme1 \ge noungloss1 \psrev n

Now, I don't know if I am using correct input, that's a big issue, so if anyone notices anything in my input, that could be sufficient to fix everything?

As it is, the code opens the toolbox file properly, so that's not a problem, the code to get a file path from the choices was already in place. However after that, the program goes into an infinite loop, which may be the issue that was originally reported in this ticket. The infinite loop is due to the program never being able to populate any entry and then trying to find something in the existing form.data that is not there. I couldn't understand completely clearly why searching in the form.data results in an infinite loop; maybe that's the big that's impossible without the web questionnaire?

The infinite loop happens here:
def __getitem__(self, key):
    if key in self.data:
        return self.data[key]
    else:
        self.data[key] = FormInfo(key, None)
        return self.data[key]
It does sound to me like this is a web-intended behaviour (just stay there until the user enters something?)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/delph-in/matrix/issues/277#issuecomment-870913777, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC54VGIRWTNEBN7OGFY7FV3TVIYANANCNFSM4NFDUGEQ .

olzama commented 3 years ago

Looks like this is more promising than the wrong path I've been trying to follow

hmm you think?.. It seems to me that this must be separate from the "Internal server" error, because that one happens right away when you try to use the "Import Lexicon" button. As for what I wrote above -- I mean, I am not even sure this is a legal way to use the code, so... Hard to tell, without having an example of definitely correct input and an example of definitely correct function calling.

bmgraves commented 3 years ago

I'm still troubleshooting the 500 error. But so far I haven't managed to find the place in the code where the break is occurring. Hopefully I'll have some sort of progress before too long.

On Tue, Jun 29, 2021 at 2:57 PM Olga Zamaraeva @.***> wrote:

Looks like this is more promising than the wrong path I've been trying to follow

hmm you think?.. It seems to me that this must be separate from the "Internal server" error, because that one happens right away when you try to use the "Import Lexicon" button. As for what I wrote above -- I mean, I am not even sure this is a legal way to use the code, so... Hard to tell, without having an example of definitely correct input and an example of definitely correct function calling.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/delph-in/matrix/issues/277#issuecomment-870946246, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC54VGNN5D7FWGCLPRT4VE3TVI6WJANCNFSM4NFDUGEQ .

dantiston commented 3 years ago

The infinite loop happens here:

    def __getitem__(self, key):
        if key in self.data:
            return self.data[key]
        else:
            self.data[key] = FormInfo(key, None)
            return self.data[key]

This looks like normal cache behavior, a la defaultdict. It seems weird that this would be causing an issue?

olzama commented 3 years ago

This looks like normal cache behavior, a la defaultdict. It seems weird that this would be causing an issue?

Yes, that's why I think I am simply calling this code improperly. But without an example of proper input or documentation, it's hard to figure out what to do... Maybe once we figure out (separately) what is causing the server issue, we'll be able to use the questionnaire to figure out what the proper input is.

olzama commented 3 years ago

I'm still troubleshooting the 500 error. But so far I haven't managed to find the place in the code where the break is occurring. Hopefully I'll have some sort of progress before too long.

I'd be more than happy to try and provide you with more input, @bmgraves , if I knew how :).

goodmami commented 3 years ago

@goodmami , do you remember how to run toolbox import via command line? I have the choices file and the toolbox file but what is the command? In particular, where does the toolbox file go? (I couldn't find any documentation for that).

No, and I don't know if it's possible or not. I just meant if you could do it from the command line it would be more direct than setting up a local server.

Also, note that the original issue migrated from Trac was filed by Emily in 2016, well before the Python3 conversion, so it sounds like there were other issues involved.

olzama commented 3 years ago

Also, note that the original issue migrated from Trac was filed by Emily in 2016, well before the Python3 conversion, so it sounds like there were other issues involved.

Oh for sure. It's just that without the questionnaire working, I can't figure out what these issues might be because I have no idea how to call the code...

delph-in / matrix

Import Toolbox Lexicon not working #277

export QUERY_STRING="customize=x&delivery=tgz" export HTTP_COOKIE=session=000