'Internal Server Error' upon downloading minimal pair CSV

ocrasborn commented 3 years ago

When I generate a list of minimal pairs from menu Analysis > Minimal Pairs, this takes some time. Downloading a CSV then takes so long for NGT that I get an 'internal server error'; for Kata Kolok it does download after... 15-30 seconds perhaps. Is there some time-out setting that can easily be changed?

susanodd commented 3 years ago

Oh, I see. The code is taking the tuple as a value: (0,1,1). Then there are 3 other tuples that differ in one field, and it determines which field is different.

susanodd commented 3 years ago

Change the example to be compare focus gloss (0,1,1) to minimal pair candidates: (1,1,1) and (0,0,1) and (0,1,0) to figure out which field they differ on. Then the "minimal pairs" are that (0,1,1) is a minimal pair to (1,1,1) on field 1 etc.

susanodd commented 3 years ago

It knows they are minimal pairs, and it's double checking because the display only has to know which field is the one they differ on.

The big query only calculates "other glosses that differ on exactly one of the minimal pairs fields". So the big query doesn't know which field is different.

susanodd commented 3 years ago

I forgot to say. The new code basically normalizes the focus gloss and the candidate minimal pairs glosses by making them tuples in the order of the minimal pairs fields constant, only for those fields. The zip with the same list of fields is needed to retrieve the actual name of the field that differs.

Woseseltops commented 3 years ago

Hey @susanodd sorry for making you wait another 2 weeks for my reply; became a father for the second time in the mean time ;).

It knows they are minimal pairs, and it's double checking because the display only has to know which field is the one they differ on.

So what you are saying is that the approach with the zipping is not actually finding the minimal pairs, but tries to figure out what exactly makes a minimal pair a minimal pair? So my simplified example should be more like this:

   FIELD_NAMES = ['field1','field2','field3']
   focus_gloss_values = (0,1,1)
   candidates = [(1,1,1), (0,0,1), (0,1,0)]

   for candidate in candidates:
       for (field_name, focus_gloss_value, candidate_value) in zip(FIELD_NAMES, focus_gloss_values,candidate_values):
            if focus_gloss_value != candidate_value:
                print(focus_gloss_values, 'differs from', candidate_value,'on field',field_name)

Is this better? And the real searching is done is this code, then, right?

https://github.com/Signbank/Global-signbank/blob/19d23b086b7b01af50c1db890d89733d42fca2bd/signbank/dictionary/adminviews.py#L2803-L2880

I feel like I'm starting to understand!

Woseseltops commented 3 years ago

As a sidenote, reading the code for this inspired me to create this issue: #759

susanodd commented 3 years ago

The minimal pairs code evolved over a long period of time. In the beginning they were only displayed in the panel of a single gloss. That was the focus gloss. The big query compares that focus gloss to all of the other glosses. It uses the values of the fields of the focus gloss as constants in the query to compare other glosses to, and count how many are different or the same. That can execute fast as such a query. But it's using the values of the focus gloss. (So it's not the same as doing logic, not a <> b, it's looking at v <> b, and it looks at the whole "row" in the database.) Some things are done to limit what is compared. If the handedness or the dominant handshape is not filled in, the gloss is not considered. (Otherwise there are numerous results that don't have phonology filled in.) (So as more phonology gets filled in more minimal pairs can be found.) (I added some new glosses to glosses tstMH so we have minimal pairs.)

Yes, the new changes are to figure out which field is the minimal pair based on. In the beginning, this didn't really matter because it was just one gloss's minimal pairs. But for exporting a whole dataset, it has to do this lots of times.

I added print statements in the original code to figure out what it was actually generating. It had been years since I last worked on this code! It does the "zip" trick in order retrieve the proper field name of the field that is different. This is dependent on adhering to the exact same ordering of the "tuples" as well as the settings MINIMAL_PAIRS_FIELDS. So the code just looks to find the different field then looks it up backwards to find its label.

This used to be done with dictionaries. For all the field choices etc, to make sure the human value was available and it was random access via the dictionary. This became as excessive amount of computation when we only needed a human readable value of the fields that are different.

This is the part that computes the minimal pairs objects:

https://github.com/Signbank/Global-signbank/blob/96640af57c3e6850bf00d3c367f2095d1fcc1c9d/signbank/dictionary/models.py#L1206-L1253

susanodd commented 3 years ago

@Woseseltops the code that you include above is the "search query" on minimal pairs. It restricts the objects that will be shown in the list. It does not compute the minimal pairs. On the minimal pairs analysis page, all objects are shown, and for each, if it has minimal pairs, those are shown in rows. The code in adminviews get_queryset hasn't computed any minimal pairs. On the minimal pairs analysis page (i.e., not the export to CSV button), it shows objects on a page and those minimal pairs are generated by ajax calls when on the page for the objects on the page.

susanodd commented 3 years ago

On the MP page, the query part is called a filter (on the focus gloss). Here is a (local) search on strong hand B on dataset tstMH. Followed by a MP filter on strong hand B. So only minimal pairs are computed (by Ajax calls) for the glosses with strong hand B.

tstMH-stronghand-B

tstMH-minimalpairs-filter-stronghand-B

susanodd commented 3 years ago

Those are example glosses borrowed from Kata Kolok to use as examples in tstMH. I also put them on Global. We previously has zero minimal pairs in tstMH.

Woseseltops commented 3 years ago

Okay @susanodd , if I understand your response correctly, you are saying that I picked the incorrect code snippet, but other than that I understand things correctly. In that case, what I said earlier...

if there is an optimization that makes the code 5 times as fast AND allows you to delete like 75% of the lines that's indeed a great discovery

... is indeed true. Congratulations on this breakthrough! I'll accept and merge the pull request!

vanlummelhuizen commented 2 years ago

@Woseseltops @susanodd Downloading a minimal-pairs csv for NGT does not result in an error or timeout anymore. Can we close this issue?

susanodd commented 2 years ago

Yes, it can be closed.

Signbank / Global-signbank

'Internal Server Error' upon downloading minimal pair CSV #743