devinit / ddw-analyst-ui

The Development Data Warehouse
http://ddw.devinit.org/
GNU General Public License v3.0
2 stars 1 forks source link

Export #41

Closed k8hughes closed 5 years ago

Duncan-Knox commented 5 years ago

I am getting instances of 'download failed - Network error'. This appears to be from attempts to export larger files. Small size files seem to export fine. Is there a limit here and can this be altered possibly?

Duncan-Knox commented 5 years ago

Still having the same issue as above with larger files. This is the same instances of 'download failed - Network error'. Has the fix come through yet?

Duncan-Knox commented 5 years ago

Definite improvement on the file size download but still seeing some larger files not being returned. The view data pane spends a while loading before stating 'No results found', this is possibly when request is too large or might be do to with something in the data. I'm testing further scenarios when this can be the case (ie more complex query builds). Wondering if we can plan to increase the limits available or they have reached the maximum?

edwinmp commented 5 years ago

Hi @Duncan-Knox ... could you please share a link to the data page, or the name of the operation.

Duncan-Knox commented 5 years ago

This comes from a position of not knowing what the limits can be. Appreciate there is a point at which exports are too large to be of use and it is of course not reasonable to expect a near limitless download.

I've added a query 'CRS year extract 02-05-2019' http://138.68.138.212/queries/data/66/. Wondering whether it is possible to extract this size of file?

I'm aiming to continue testing on seeing whether more complex queries (on smaller datasets) end with the 'No results found', and post if I spot anything.

Duncan-Knox commented 5 years ago

Definitely would argue knowing any query size limits would be useful information for users, if possible?

Duncan-Knox commented 5 years ago

Edit @edwinmp had accidentally deleted the query previously linked. New data page link here: http://138.68.138.212/queries/data/66/

edwinmp commented 5 years ago

@Duncan-Knox @akmiller01 The operation above returned a 502 Bad Gateway error. Data was returned for limited operations of up to 10,000 records... It broke when I tried 100,000. My guess is that a timeout occurs because the operation takes too long @akmiller01 ... I'm looking into optimisation alternatives for this, mostly because I think the response shouldn't be this slow or even fail over just under 100,000 records...

akmiller01 commented 5 years ago

@edwinmp does the branch you tested on include the estimated-count refactor already?

edwinmp commented 5 years ago

If it was merged into develop, then yeah...

akmiller01 commented 5 years ago
>>> from core.models import Operation
>>> big_query = Operation.objects.get(pk=66)
>>> big_query.build_query(estimate_count=True)
('SELECT "n_live_tup" FROM "pg_stat_user_tables" WHERE "relname"=\'crs_current\' AND "schemaname"=\'repo\'', 'SELECT * FROM "repo"."crs_current" WHERE "year"=\'2017\'')
>>> result = big_query.query_table(limit=100000,offset=0,estimate_count=True)

...And then we get a process crash. Watching top it looks like the server attempts to access memory swap right before the crash: image

So in other words, 100,000 rows of CRS is too big for the memory capacity of our server.

akmiller01 commented 5 years ago

Only capped out on about 30% of the CPU. So processing power isn't the issue.

edwinmp commented 5 years ago

What's the largest possible number of rows we can have? Would be great get us enough resources to handle at least several times that.

Napho commented 5 years ago

This is what i thought would happen

  1. That at no time would someone fetch 100,000 rows to display on the UI
  2. That all content of the tables will be paginated and limit for page size is set to probably 100 or so
  3. That in the event of exporting the data, that data would be streamed to filesystem storage and a url generated to download from there as opposed to loading it to user's memory and saved from memory to their desktop

With the above, i dont think we should face the complexity of memory limitations

k8hughes commented 5 years ago

The UI doesn't need to show all the data the UI is just ment to display a sample of the data to enable the user to feel confident their query has worked before downloading the sheet...

akmiller01 commented 5 years ago

It's beyond just the UI. The server is incapable of fetching that many rows from the database fullstop: Plugging the query into PSQL crashes it:

analyst_ui=# select * from "repo"."crs_current" where "year"=2017;
root@ddw-analyst-ui-dev:~/ddw-analyst-ui# 
edwinmp commented 5 years ago

BTW @akmiller01 ... you mentioned a while back that you'd build export infrastructure for file download ... couldn't find it, so I went with @Napho's first assumption ... @k8hughes even if we paginate the data, the export feature would require that we download it all.

akmiller01 commented 5 years ago

And this might just be unique to the CRS. It's the only data-source we have where individual cells in individual rows can contain whole paragraphs of long_descriptions. They're particularly dense rows.

edwinmp commented 5 years ago

This is what i thought would happen

  1. That at no time would someone fetch 100,000 rows to display on the UI
  2. That all content of the tables will be paginated and limit for page size is set to probably 100 or so
  3. That in the event of exporting the data, that data would be streamed to filesystem storage and a url generated to download from there as opposed to loading it to user's memory and saved from memory to their desktop

With the above, i dont think we should face the complexity of memory limitations

Option 3 seems wise ... when is this done? For each operation on creation?

Duncan-Knox commented 5 years ago

Hi @edwinmp and @akmiller01. I worked a query through with Nik and we are not getting any output. Here's the query: http://138.68.138.212/queries/data/109/. I was wondering if it is possible to determine what is behind the potential bug?

akmiller01 commented 5 years ago

Year value cannot be simultaneously equal to 2017 and 2018 at the same time image

If you wanted between 2017 and 2018, you need to do less than or equals 2018 and greater than or equals 2017

Duncan-Knox commented 5 years ago

Hi @akmiller01 fair point in the impracticality of the step (ie the user could opt for year is greater than 2016). However the same logic gets output here on a different dataset: http://138.68.138.212/queries/build/113/. That's why I raised as I'd thought the application of a new filter within a step acted as an 'or'.

Duncan-Knox commented 5 years ago

I think it is because destination usage year id was selected! That won't have 2016 [edit 2017] in it.

akmiller01 commented 5 years ago

Fair point, Duncan! Multiple filters in a step should he treated as OR. I guess I need more coffee this morning...

edwinmp commented 5 years ago

Don't think the last few comments are related to the export feature. @Duncan-Knox @akmiller01 should I close this?

Duncan-Knox commented 5 years ago

Hi @edwinmp, the testing has seen a few further instances of data queries not loading. However, with the frequent development work on the exports, this can change as updates are pushed through. Would you like me to continue to share links that are not loading here?

The testers also raised the column ordering of exports was not matching the ordering within the original sources. However this recently has been changed for the csv output, which is great. Unsure if intended that the ordering within view data does not match the csv output. That said, this to me is not a significant issue.

Duncan-Knox commented 5 years ago

Hi @akmiller01, as discussed on Slack, here are a handful of queries where the output csv does not appear to contain all the data the query would generate:

This ones aims to get a full CRS data set 2017 (approx 257,587 rows). http://138.68.138.212/queries/data/142/

The first export I tried brought out 36,564 rows, the second 35,816, the third 36,147. It's like the export grabs as much as possible in a frame of time before opening. The grab may not get the full amount?

Similarly: in this query http://138.68.138.212/queries/build/148/ I've used select to remove most of the lengthy fields and the download generates a larger file (84,839 rows first attempt, 81,649 second attempt). So a higher amount but not the full house.

Another similar data grab for Agriculture Sector activities will obtain around 50,000-60,000 rows and vary by download: http://138.68.138.212/queries/data/150/

Flagging these as was not sure if you'd been aware of this one.

akmiller01 commented 5 years ago

First thing I noticed, the import might be slightly different:

analyst_ui=# select count(*) from "repo"."crs_current" where "year"=2017;
 count  
--------
 257430
(1 row)

But that cannot account for a difference of over 200k rows.

Re-added the stream_to_file function to test the fundamentals of streaming the PSQL, and that worked flawlessly:

>>> from core.models import Operation
>>> 
>>> from data.db_manager import stream_to_file
>>> op = Operation.objects.get(pk=142)
>>> stream_to_file(op.build_query())
(257430, '_django_curs_139692890453760_1.csv')

So as I speculated in Slack, this must be due to the StreamingHttpResponse, either on the Django or NGINX side. Going to try disabling the NGINX buffer to see if that's the cause.

akmiller01 commented 5 years ago

Confirming static file is actually full data: image

akmiller01 commented 5 years ago

Fixed on staging and here: https://github.com/devinit/ddw-analyst-ui/pull/99

Gunicorn and NGINX default to timeouts of around 30-60 seconds, so the filestream would end abruptly when the timeout was called. Changed them to 10 minutes each.

Duncan-Knox commented 5 years ago

I've re-tested the fix with the output linked about and they are coming out fine.

edwinmp commented 5 years ago

@akmiller01 This export is failing with a server error http://ddw.devinit.org/queries/data/258

akmiller01 commented 5 years ago

On a train, but ssh into the server and check df -h if we're at 100% disk usage, that's why it's failing.

akmiller01 commented 5 years ago

Alternatively, Nginx needs proxy_buffering off; found that on the server when redeploying develop and it wasn't in any of our branches so I figured it may not have been needed.