Signbank / Global-signbank

An online sign dictionary and sign database management system for research purposes. Developed originally by Steve Cassidy/ This repo is a fork for the Dutch version, previously called 'NGT-Signbank'.
http://signbank.cls.ru.nl
BSD 3-Clause "New" or "Revised" License
19 stars 12 forks source link

'Internal Server Error' upon downloading minimal pair CSV #743

Closed ocrasborn closed 2 years ago

ocrasborn commented 3 years ago

When I generate a list of minimal pairs from menu Analysis > Minimal Pairs, this takes some time. Downloading a CSV then takes so long for NGT that I get an 'internal server error'; for Kata Kolok it does download after... 15-30 seconds perhaps. Is there some time-out setting that can easily be changed?

Woseseltops commented 3 years ago

If @susanodd didn't look into by then, I will look into this Thursday 29.

My first steps would be:

susanodd commented 3 years ago

@Woseseltops, I finished the improvements to Admin and have pushed them to branch admin_issues_739_682

I can look at this issue too now. But you are welcome to work on it! I recall we had some similar issue a while back, but with importing big files.

Woseseltops commented 3 years ago

Thanks @susanodd I indeed already started this morning!

What I figured out is that the problem is in function minimalpairs_focusgloss, in tools.py. This function takes 0.25-0.5 seconds or so, and is called thousands of times in the situation @ocrasborn describes. It takes so long because it takes a gloss and then compares to a (potentially) large collection of other glosses.

I see two approaches to solving this:

The first one is to 'make the wait more bearable', and improve the user experience. We could create an 'export task' with its own ID and detail page that the user can go back to to see if the file is already done and available for download. We could even have the application give updates on its progress, as we know really well how many glosses are already done and how many should be done in total.

The second one is the optimize the whole process, which comes down to decrease the number of glosses to compare to. If I understand things correctly, @susanodd has already done some optimization inside the minimalpairs_objects() function on the gloss object. An optimization I was thinking about is something along the lines of this:

  1. Select one field, and divide all glosses into a group where the value of the field is the same as the focus gloss (group A), and a group where it is different (group B). Filtering glosses like this should be fast. I expect group A to be small and group B to be large.
  2. All glosses in group B have 1 thing different. If they also have a second thing different, they can be excluded. Go over each field and filter group B so you only keep the ones that have a similar value. The result should be a really small group.
  3. Then really compare the focus gloss to the remaining glosses (group A and small group B) one by one.

What do you think @susanodd ?

susanodd commented 3 years ago

The first approach seems more straightforward.

I know the current computation sums up for each other gloss how many fields differ from the focus gloss. It also takes into account whether a field is empty. The computation is done by a database query constructed in this method: https://github.com/Signbank/Global-signbank/blob/b0e434d35621b5cdad018f92b5ddb17cd386693f/signbank/dictionary/models.py#L1287

Could it be that the size of the query that this is done on is too big? Is that what you are thinking of? Just do it on smaller partitions instead of all glosses?

Except to exclude a set of other glosses they should differ on at least two fields. Then don't consider them further.

ocrasborn commented 3 years ago

Hi @susanodd @Woseseltops : It's vital for the user that all glosses are compared to all glosses. But the speed is not vital: it's perfectly fine to have this generated offline (say, overnight), and then made available for download by the CSV button. Or, alternatively, to have it generated and then presented as a download-link, while in the meantime the user can continue working. (Is it the server that chokes on the long computation, or is it the browser? I'm assuming the latter, hence my suggestions.)

ocrasborn commented 3 years ago

(Incidentally, I just enjoyed the minimal pair functions there are now for NGT quite a bit for looking up a set of signs, both through the Analysis menu and through the Related Signs tab -- just to let you know that we're actually using this regularly in the interface. :-) The CSV download I don't often use, that's why I only encountered this issue now.)

Woseseltops commented 3 years ago

It's vital for the user that all glosses are compared to all glosses

I understand, @ocrasborn ! Sorry if I'm being unclear; any optimization should not lead to different results. When I say 'decrease the number of glosses to compare to' I mean to exclude them in an earlier stage because they can never be a minimal pair.

Is it the server that chokes on the long computation, or is it the browser? I'm assuming the latter, hence my suggestions.

It's the interaction between the two: it's the server that takes so long, because there is so much work to do, and as a result the browser at some point stops believing (incorrectly) that something will come at all.

just to let you know that we're actually using this regularly in the interface. :-) The CSV download I don't often use, that's why I only encountered this issue now

Indeed good to know what functionality is used and what isn't; helps us prioritize!

Could it be that the size of the query that this is done on is too big? Is that what you are thinking of?

Not 100% sure, @susanodd . I assume it is not so much the query, but the fact that you iterate over the result of the query.

Except to exclude a set of other glosses they should differ on at least two fields. Then don't consider them further.

Yes that was the direction I was thinking with my three steps above; I guess with a few basic queries you can exclude a large group of glosses quickly because they have two values different. The real fine grained search can be done on the small group of glosses that you did not exclude. I don't fully understand your current approach, so you might already be doing that.

Woseseltops commented 3 years ago

In any case, you both seem to like UI improvement approach, so let's go for that in this issue. Proposal:

Phase 1:

Phase 2:

I'll unassign myself so you can pick it up if you want @susanodd !

ocrasborn commented 3 years ago

I understand, @ocrasborn ! Sorry if I'm being unclear; any optimization should not lead to different results. When I say 'decrease the number of glosses to compare to' I mean to exclude them in an earlier stage because they can never be a minimal pair.

Thanks for clarifying @Woseseltops !

susanodd commented 3 years ago

I'll look at see what happens "with" the found minimal pairs objects. The method that finds them does a query for the focus gloss. The analysis display generates ajax minimal pairs queries for the objects shown on the page. So objects that do not have minimal pairs are also displayed (but without any). It could be that the generation of all the ajax calls somehow interferes/impedes (or slows down) the export CSV which does this for all objects. The only place so many calls would be done is in exporting a csv. The other places the minimal pairs are displayed is while inspecting a focus gloss.

Some additional "computation" is done for displaying the rows. The values shown and the field name get translated (by Django), and there is a check for displaying Booleans, Yes, No, as well. But those should just be minimal.

I used to wonder if having all of the duplicate handshapes (all the 50 duplicates with machine value 2) were somehow slowing down the system. But those are deleted.

susanodd commented 3 years ago

I may have fixed this. I changed the evaluation order for writing the csv file. The computation is (was) per row inside the csv writer. I made a separate loop to store the result rows in a list, before creating the csv. Then the csv is generated by iterating over those list element rows. On my local server, it took 15 minutes to generate the NGT minimal pairs. That's when the prompt to save the file appeared.

So the actual call to create the csv writer now appears in the code after the rows have been created (by iterating over the big query result displayed on the pages).

(It wasn't completely obvious the code was doing this because the functionality is spread over multiple functions for purposes of code readability.)

susanodd commented 3 years ago

This has been pushed to branch lemmas_minimalpairs_733_743

Woseseltops commented 3 years ago

Ah great you found another place for optimization, @susanodd ! I still think that a response time of 15 minutes might confuse users, and would require some user experience improvements to ease the pain, what do you think @ocrasborn ?

ocrasborn commented 3 years ago

Yes, indeed @Woseseltops / @susanodd . Some sites nowadays work with generated download-links, that might be an idea. Do you know what I mean? Users first get a message to say 'download is being prepared, you will find it under X [page, panel, wherever]', so the user can continue working on the site, and at some point there's a download link on a certain page for them to click. Is that an option?

ocrasborn commented 3 years ago

@susanodd , would you be able to generate the file for NGT for us and send it to me by email? One of our colleagues needs it to progress with her work.

susanodd commented 3 years ago

Okay, I need to download the newest database first. Although I don't have a way to get the file onto my telephone.

susanodd commented 3 years ago

Okay, I sent you an email with the file location.

susanodd commented 3 years ago

I'm not entirely sure if by way of a link it would necessarily improve speed. It would probably need to be done during the night rather than just offering a link. The computation itself would need to be delayed, not just done in the background while the user continues working. On my local machine, it bogs down the machine. I go and drink coffee.

This example is exporting the entire NGT database minimal pairs. So there would need to be some indication of "how big" the expected set would be before it causes problems? At what size does it become inefficient? If I only generate a small query, then it's not a problem.

ocrasborn commented 3 years ago

@susanodd : for me, it would be perfectly acceptable if users can download a file that has been generated overnight -- but that would mean that this needs to be done for all datasets every night.

vanlummelhuizen commented 3 years ago

@susanodd Can you think of a way that calculating minimal pairs and similar things would only do things that have changed, and storing things that have not changed? This would make recalculating stuff much quicker, right?

susanodd commented 3 years ago

Where would this be stored? Something like the generate_choice_list_table? (I'm not sure whether we are using this anymore. This was before Ajax calls were introduced to dynamically put frequencies in the pull-down lists. The code below is not using the field_choice_category either. But this idea? And then update it as required?)

https://github.com/Signbank/Global-signbank/blob/c419f836e32ee3c5d84901346ad5dd4f5e121a25/signbank/dictionary/models.py#L1836-L1856

susanodd commented 3 years ago

We also have methods to compute identical phonology. Maybe something partitioning glosses that way.

susanodd commented 3 years ago

Ouch! There's a second one!

https://github.com/Signbank/Global-signbank/blob/c419f836e32ee3c5d84901346ad5dd4f5e121a25/signbank/dictionary/models.py#L1908

https://github.com/Signbank/Global-signbank/blob/c419f836e32ee3c5d84901346ad5dd4f5e121a25/signbank/dictionary/models.py#L1981

susanodd commented 3 years ago

@vanlummelhuizen once I have such a data structure, where is the best place to initialize it?

At the moment, I've put initialization inside of the MinimalPairsListView, with a check whether it's been initialized. That indeed keeps it from being done more than once, but it's not the best place.

I've made the initial creation much faster by only storing the gloss ids of minimal pairs, rather than the user friendly data. That will need to be generated for display. To obtain this data structure (with just ids for all datasets), this takes about 10 minutes. I want to only do this once, and then lazily if updates have been made to glosses. Those are stored in the GlossRevision table.

susanodd commented 3 years ago

What about inside reload_signbank ?

vanlummelhuizen commented 3 years ago

@susanodd The things you mention, generate_translated_choice_list_table and generate_translated_choice_list_table are called when Signbank is (re)started, either from the command line of via reload_signbank. This is not what a meant. What I meant was that the calculated minimal pairs are stored persistently a database, possibly but not necessarily the same database as the rest of the Signbank data. It works like a cache if you will.

The procedure would be something like:

  1. Run the calculation once for all glosses
  2. Store the results in a database
  3. When a gloss has changed, remove minimal pairs that gloss is in
  4. Recalculate minimal pairs for that gloss and store them in the database

Generating a minimal pairs CSV would then be no more than retrieving what is in that database.

@Woseseltops what do you think?

PS. generate_translated_choice_list_table and generate_translated_choice_list_table are thing I am trying to get rid of in the restructuring of the FieldChoice setup #658 #661

susanodd commented 3 years ago

Yes, I realise that. I'll add one of those urls for generating it the first time, that only we can use. I'm first getting the structure of minimal-pairs-related objects to gauge what minimal effort is needed. What is a bit worrying about "updating" the structure is that it's not just the objects that a changed object is a minimal pair with; those objects minimal pairs also need to be recalculated, etc. etc. The minimal pairs also exist when one gloss has no value for a field and the other does. Step 3 and 4 need to be done until no glosses are related to other glosses. Additionally, glosses without handedness are not included in minimal pairs. So once those are modified, that gloss will need all minimal pairs computed, etc. etc.

susanodd commented 3 years ago

We may have considered bit strings at one point. Except we also have the field choices, which can also be changing.

susanodd commented 3 years ago

We could put buttons on the Datasets page, next to the ECV buttons and just generate them the same way. EDIT: There can be an extra popup on the MP button that informs the user if the file needs to be generated again because glosses have been updated. . .

The CSV button on the Minimal Pairs page is primarily useful if you've done a query you're interested in.

susanodd commented 3 years ago

I can make a separate database for this. That might assist computations since the whole thing is non-linear.

EDIT: If I do this on my computer, is there anything I need to do in order to make sure it can be used? Should there be a 'stub' created for it on Global? There will be migrations for such a second database as well.

susanodd commented 3 years ago

I was just thinking about getting the time stamps of gloss edits out of the GlossRevision table. Say, to check whether the minimal pairs structure needs to be updated. Since everything is dynamic with field choices, too, conceivably, we'd also need such information to see whether any field choices have been modified.

vanlummelhuizen commented 3 years ago

What is a bit worrying about "updating" the structure is that it's not just the objects that a changed object is a minimal pair with; those objects minimal pairs also need to be recalculated, etc. etc.

OK. Let's do an experiment: If gloss A has a feature list 1,0,0,0 and B has 1,1,0,0, C has 1,0,1,0 and D has 0,1,0,0. In this situation there are 3 minimal pairs: AB, AC and BD. Let's say A changes to 0,1,0,1. We then remove AB and AC. Recalculating everything for A results in AD. No need to recalculate anything else. BD stays as it was. Do you think I am overlooking something?

I can make a separate database for this. That might assist computations since the whole thing is non-linear.

EDIT: If I do this on my computer, is there anything I need to do in order to make sure it can be used? Should there be a 'stub' created for it on Global? There will be migrations for such a second database as well.

Why do you think a separate database is best (I am not saying it is or it is not)? What do you mean by stub?

Since everything is dynamic with field choices, too, conceivably, we'd also need such information to see whether any field choices have been modified.

Only when a field choice is deleted, right? Changing a field choice is the same for all linked glosses, so for the two glosses in a minimal pair, a changed field choice does not affect whether it is a minimal pair or not.

susanodd commented 3 years ago

I'm using external json files now, one per dataset, to store the minimal pairs data (focus gloss id paired with a list of other object ids). This will assist the methods by being already computed.

It's not as simple as your example. The mps are from the perspective of the focus gloss, and over 14 fields.

I'm not convinced it's easy to determine when to regenerate the info. In the worst case somebody could keep modifying a gloss and keep searching to find out if it has and minimal pairs yet. But that might just be us. There are a lot of glosses where very little phonology has been filled in.

susanodd commented 3 years ago

Using just the ids in the data structure makes my comment about field choices not matter. But if we were to store CSV files, then this would matter, because the focus gloss is shown with the minimal pairs gloss and the human-readable field name that differs and field values of both glosses.

susanodd commented 3 years ago

Hmmm. This external data structure thing isn't such a wise idea. It's okay for the smaller datasets with not so many minimal pairs. But for the NGT this is even worse than it was.

What I did so far is to save the MP data to an external json file. I tried pickle first, but there was too much data. This is just computed once to create the files. Then the files are just read in if they exist, (at the moment this is done when minimal pairs are first needed, this is negligible.). (Ignore anything about updating them for the moment.) Then when minimal pairs need to be shown, this table is used rather than querying the database. This works fine.

The big test however is wanting to export a MP CSV for the NGT dataset. This is worse than it was! Apparently the giant data structure (residing nonetheless in memory), is too big to help. The computation freezes up the entire computer now. I've had to reboot twice to stop it.

DEBUG doesn't work on my computer for some reason. I wanted to see what is causing this problem.

susanodd commented 3 years ago

Make that three reboots.

susanodd commented 3 years ago

Q @vanlummelhuizen asks:

I'm using an external json file per dataset now. Basically an externally stored giant dict mapping object ids to a list of object ids of its minimal pairs. This does not seem to be very efficient because its replaced the computation with accessing (potentially rather large) files instead. This should only be done once, but who is to say how often, since it depends on how frequently the users update glosses. The user could get the system thrashing by enthusiastic editing.

To use the external storage, it has to be initialized to None in Python/Django before it gets filled in (read from disk) not until after the models / database have been loaded. I mean stub in this context. The chicken or the egg. Something needs to be stored before it gets filled in to avoid errors of the thing not being declared before it is used. If we would decide to have a separate database (you made the initial suggestion, by stating that this should not be stored in the normal database, implying it should be stored elsewhere), this "other" database has to exist before it gets filled in, just like the json files. Although how persistent all of this needs to be isn't necessarily anything other than opinion. We're only trying to make something efficient. I was contemplating making a specialized database type structure that could be more readily computed over specifically for minimal pairs. It's true they ought to be relations. But they are not computed that way. You can't do a query to "find all minimal pairs" because it has to be w.r.t. a focus gloss. (The algorithm loops over focus glosses, that's how it generates the database queries to "find other glosses that differ by one field".) Sure, it can be formulated elegantly with logic, but that sounds like something not algorithmic.

susanodd commented 3 years ago

First test is done. Using the loaded in tables described above, the export of the CSV for dataset NGT now takes circa 10 minutes. It previously took (takes) 15 minutes without the tables. So a speed up of 1/3.

I'll try to find other improvements since it's still not that efficient.

vanlummelhuizen commented 3 years ago

you made the initial suggestion, by stating that this should not be stored in the normal database, implying it should be stored elsewhere

I said "possibly but not necessarily the same database as the rest of the Signbank data". In other words you could use the existing database or something else, whatever fits best.

Anyway, I think there are two problems here we would like to solve:

  1. Store/cache Minimal Pairs data for quick retrieval.
  2. Efficient calculation of Minimal Pairs. I say do it once for all glosses and then only for one gloss if that gloss has changed.

Perhaps this is something to with the three of us, you, me and @Woseseltops, in video conference call. What do you think?

susanodd commented 3 years ago

I'm adding signs to tstMH (locally) that are related to Pumpkin of Kata Kolok. (I'll upload them to the tstMH dataset later so it only needs to be done once.) I think it's useful as a test case, since it's real signs with more fields filled in. We don't have many useful test glosses, esp. for minimal pairs.

As described above, I made the data structure that uses gloss ids. Now I'm looking at what it actually is doing. There needs to be a url to regenerate the external data structure as needed. It's annoying otherwise.

I've added a lot or print statements to the MP computation to help identify where the computation is overloaded. I notice the code seems kind of redundant at times. I am the one to blame for this. Phase 2.

Probably some more deciphering with the test glosses is needed in order to do points 1 and 2. I'm working on determining "Why is it so inefficient, anyway?"

susanodd commented 3 years ago

I've made a significant improvement to the code!!!

I rewrote one of the functions that compares the minimal pairs objects and generates the table that gets shown in the displays. I modified it so it makes flat tuples for the glosses rather than dictionaries.

Now it only took less than four minutes to generate the NGT CSV file!!

I ran the stopwatch on my Watch, so that's only approximately, since I started it before hitting CSV and it was a surprise when the prompt came to save the file and I stopped the timer.

susanodd commented 3 years ago

After more improvements to the code. . . Now without using any external storage it takes circa 3 minutes to export the NGT minimal pairs!! Plus I got rid of several auxiliary methods that were being called.

One part I'm not sure about, I'm zipping three tuples together inside a loop over objects. High on code readability and elegance, but perhaps not wise. @Woseseltops @vanlummelhuizen ?

I ask because sometimes we disagree about code readability for purposes of code maintenance. I'm not entirely sure whether the compiler doesn't matter one way or the other. I had it without the zip of the tuples at first, but realised the code looked quite awful.

susanodd commented 3 years ago

By the way, committing this is awaiting merging of the previous pull request, since I'm still working on my previous branch before creating a new branch.

susanodd commented 3 years ago

Here is the relevant snippet:


            zipped_tuples = zip(settings.MINIMAL_PAIRS_FIELDS, focus_gloss_values_tuple, other_gloss_values_tuple)
            for (field_name, field_value, other_field_value) in zipped_tuples:

Either this is quite elegant, or an efficiency headache.

susanodd commented 3 years ago

I think the 3 minutes is acceptable. But could it be faster?

susanodd commented 3 years ago

This has been pushed to branch minimal_pairs_743.

Woseseltops commented 3 years ago

Hey @susanodd , if there is an optimization that makes the code 5 times as fast AND allows you to delete like 75% of the lines that's indeed a great discovery. From your code snippet, however, I'm not sure if I understand it.

Say these are our glosses:

a = {'field1':0,'field2':1,'field3':1}
b = {'field1':1,'field2':1,'field3':1}
c = {'field1':0,'field2':0,'field3':1}
d = {'field1':0,'field2':1,'field3':0}

And we want to compare the first one to the other ones, comparing all fields:

focus_gloss = a
other_glosses = [b,c,d]
minimal_pairs_fields = ['field1','field2','field3']

We run the raw, unoptimized comparison like this:

for other_gloss in other_glosses:
    for field in minimal_pairs_fields:
        print('Comparing',focus_gloss,'to',other_gloss,'on',field)

This results in

Comparing {'field1': 0, 'field2': 1, 'field3': 1} to {'field1': 1, 'field2': 1, 'field3': 1} on field1
Comparing {'field1': 0, 'field2': 1, 'field3': 1} to {'field1': 1, 'field2': 1, 'field3': 1} on field2
Comparing {'field1': 0, 'field2': 1, 'field3': 1} to {'field1': 1, 'field2': 1, 'field3': 1} on field3
Comparing {'field1': 0, 'field2': 1, 'field3': 1} to {'field1': 0, 'field2': 0, 'field3': 1} on field1
Comparing {'field1': 0, 'field2': 1, 'field3': 1} to {'field1': 0, 'field2': 0, 'field3': 1} on field2
Comparing {'field1': 0, 'field2': 1, 'field3': 1} to {'field1': 0, 'field2': 0, 'field3': 1} on field3
Comparing {'field1': 0, 'field2': 1, 'field3': 1} to {'field1': 0, 'field2': 1, 'field3': 0} on field1
Comparing {'field1': 0, 'field2': 1, 'field3': 1} to {'field1': 0, 'field2': 1, 'field3': 0} on field2
Comparing {'field1': 0, 'field2': 1, 'field3': 1} to {'field1': 0, 'field2': 1, 'field3': 0} on field3

If I use your zipping approach...

for other_gloss,field in zip(other_glosses,minimal_pairs_fields):
    print('Comparing',focus_gloss,'to',other_gloss,'on',field)

...the result is:

Comparing {'field1': 0, 'field2': 1, 'field3': 1} to {'field1': 1, 'field2': 1, 'field3': 1} on field1
Comparing {'field1': 0, 'field2': 1, 'field3': 1} to {'field1': 0, 'field2': 0, 'field3': 1} on field2
Comparing {'field1': 0, 'field2': 1, 'field3': 1} to {'field1': 0, 'field2': 1, 'field3': 0} on field3

As you can see, it's skipping a lot of comparisons that would have resulted in a minimal pair. How do I need to change to this simple code example to make it work like your solution?

susanodd commented 3 years ago

No, there has already been done the majority of the work in the Query part that determines which objects can be minimal pairs. This part is specific to the already reduced sets of objects. It is known that they differ by one field. The old code tediously went through and looked for this. The new code uses tuples instead. In the zip loop, it only needs to figure out which field is different. Previously, it created a dictionary that contained all the phonology fields. There are special cases in the Query code that checks for empty values. So there can still be exceptions because of empty fields, that's why the "minimal pair contenders" need to be checked again to confirm there is only one field that is different. So the set has already been reduced.

susanodd commented 3 years ago

The example fields you show, it's finding minimal pairs of the focus gloss. Not minimal pairs of each other. It's only comparing the focus gloss to the candidate other glosses, which it's already known (via the query to find the contenders) that they differ by one field.

susanodd commented 3 years ago

Comparing {'field1': 0, 'field2': 1, 'field3': 1} to {'field1': 1, 'field2': 1, 'field3': 1} on field1 Comparing {'field1': 0, 'field2': 1, 'field3': 1} to {'field1': 1, 'field2': 1, 'field3': 1} on field2 Comparing {'field1': 0, 'field2': 1, 'field3': 1} to {'field1': 1, 'field2': 1, 'field3': 1} on field3

Huh? the first tuple is being compared to the second tuple three times.