Closed k8hughes closed 5 years ago
Still having the same issue as above with larger files. This is the same instances of 'download failed - Network error'. Has the fix come through yet?
Definite improvement on the file size download but still seeing some larger files not being returned. The view data pane spends a while loading before stating 'No results found', this is possibly when request is too large or might be do to with something in the data. I'm testing further scenarios when this can be the case (ie more complex query builds). Wondering if we can plan to increase the limits available or they have reached the maximum?
Hi @Duncan-Knox ... could you please share a link to the data page, or the name of the operation.
This comes from a position of not knowing what the limits can be. Appreciate there is a point at which exports are too large to be of use and it is of course not reasonable to expect a near limitless download.
I've added a query 'CRS year extract 02-05-2019' http://138.68.138.212/queries/data/66/. Wondering whether it is possible to extract this size of file?
I'm aiming to continue testing on seeing whether more complex queries (on smaller datasets) end with the 'No results found', and post if I spot anything.
Definitely would argue knowing any query size limits would be useful information for users, if possible?
Edit @edwinmp had accidentally deleted the query previously linked. New data page link here: http://138.68.138.212/queries/data/66/
@Duncan-Knox @akmiller01 The operation above returned a 502 Bad Gateway
error. Data was returned for limited operations of up to 10,000 records... It broke when I tried 100,000. My guess is that a timeout occurs because the operation takes too long @akmiller01 ... I'm looking into optimisation alternatives for this, mostly because I think the response shouldn't be this slow or even fail over just under 100,000 records...
@edwinmp does the branch you tested on include the estimated-count refactor already?
If it was merged into develop
, then yeah...
>>> from core.models import Operation
>>> big_query = Operation.objects.get(pk=66)
>>> big_query.build_query(estimate_count=True)
('SELECT "n_live_tup" FROM "pg_stat_user_tables" WHERE "relname"=\'crs_current\' AND "schemaname"=\'repo\'', 'SELECT * FROM "repo"."crs_current" WHERE "year"=\'2017\'')
>>> result = big_query.query_table(limit=100000,offset=0,estimate_count=True)
...And then we get a process crash.
Watching top
it looks like the server attempts to access memory swap right before the crash:
So in other words, 100,000 rows of CRS is too big for the memory capacity of our server.
Only capped out on about 30% of the CPU. So processing power isn't the issue.
What's the largest possible number of rows we can have? Would be great get us enough resources to handle at least several times that.
This is what i thought would happen
With the above, i dont think we should face the complexity of memory limitations
The UI doesn't need to show all the data the UI is just ment to display a sample of the data to enable the user to feel confident their query has worked before downloading the sheet...
It's beyond just the UI. The server is incapable of fetching that many rows from the database fullstop: Plugging the query into PSQL crashes it:
analyst_ui=# select * from "repo"."crs_current" where "year"=2017;
root@ddw-analyst-ui-dev:~/ddw-analyst-ui#
BTW @akmiller01 ... you mentioned a while back that you'd build export infrastructure for file download ... couldn't find it, so I went with @Napho's first assumption ... @k8hughes even if we paginate the data, the export feature would require that we download it all.
And this might just be unique to the CRS. It's the only data-source we have where individual cells in individual rows can contain whole paragraphs of long_descriptions. They're particularly dense rows.
This is what i thought would happen
- That at no time would someone fetch 100,000 rows to display on the UI
- That all content of the tables will be paginated and limit for page size is set to probably 100 or so
- That in the event of exporting the data, that data would be streamed to filesystem storage and a url generated to download from there as opposed to loading it to user's memory and saved from memory to their desktop
With the above, i dont think we should face the complexity of memory limitations
Option 3 seems wise ... when is this done? For each operation on creation?
Hi @edwinmp and @akmiller01. I worked a query through with Nik and we are not getting any output. Here's the query: http://138.68.138.212/queries/data/109/. I was wondering if it is possible to determine what is behind the potential bug?
Year value cannot be simultaneously equal to 2017 and 2018 at the same time
If you wanted between 2017 and 2018, you need to do less than or equals 2018
and greater than or equals 2017
Hi @akmiller01 fair point in the impracticality of the step (ie the user could opt for year is greater than 2016). However the same logic gets output here on a different dataset: http://138.68.138.212/queries/build/113/. That's why I raised as I'd thought the application of a new filter within a step acted as an 'or'.
I think it is because destination usage year id was selected! That won't have 2016 [edit 2017] in it.
Fair point, Duncan! Multiple filters in a step should he treated as OR. I guess I need more coffee this morning...
Don't think the last few comments are related to the export feature. @Duncan-Knox @akmiller01 should I close this?
Hi @edwinmp, the testing has seen a few further instances of data queries not loading. However, with the frequent development work on the exports, this can change as updates are pushed through. Would you like me to continue to share links that are not loading here?
The testers also raised the column ordering of exports was not matching the ordering within the original sources. However this recently has been changed for the csv output, which is great. Unsure if intended that the ordering within view data does not match the csv output. That said, this to me is not a significant issue.
Hi @akmiller01, as discussed on Slack, here are a handful of queries where the output csv does not appear to contain all the data the query would generate:
This ones aims to get a full CRS data set 2017 (approx 257,587 rows). http://138.68.138.212/queries/data/142/
The first export I tried brought out 36,564 rows, the second 35,816, the third 36,147. It's like the export grabs as much as possible in a frame of time before opening. The grab may not get the full amount?
Similarly: in this query http://138.68.138.212/queries/build/148/ I've used select to remove most of the lengthy fields and the download generates a larger file (84,839 rows first attempt, 81,649 second attempt). So a higher amount but not the full house.
Another similar data grab for Agriculture Sector activities will obtain around 50,000-60,000 rows and vary by download: http://138.68.138.212/queries/data/150/
Flagging these as was not sure if you'd been aware of this one.
First thing I noticed, the import might be slightly different:
analyst_ui=# select count(*) from "repo"."crs_current" where "year"=2017;
count
--------
257430
(1 row)
But that cannot account for a difference of over 200k rows.
Re-added the stream_to_file
function to test the fundamentals of streaming the PSQL, and that worked flawlessly:
>>> from core.models import Operation
>>>
>>> from data.db_manager import stream_to_file
>>> op = Operation.objects.get(pk=142)
>>> stream_to_file(op.build_query())
(257430, '_django_curs_139692890453760_1.csv')
So as I speculated in Slack, this must be due to the StreamingHttpResponse, either on the Django or NGINX side. Going to try disabling the NGINX buffer to see if that's the cause.
Confirming static file is actually full data:
Fixed on staging and here: https://github.com/devinit/ddw-analyst-ui/pull/99
Gunicorn and NGINX default to timeouts of around 30-60 seconds, so the filestream would end abruptly when the timeout was called. Changed them to 10 minutes each.
I've re-tested the fix with the output linked about and they are coming out fine.
@akmiller01 This export is failing with a server error http://ddw.devinit.org/queries/data/258
On a train, but ssh into the server and check df -h
if we're at 100% disk usage, that's why it's failing.
Alternatively, Nginx needs proxy_buffering off;
found that on the server when redeploying develop and it wasn't in any of our branches so I figured it may not have been needed.
I am getting instances of 'download failed - Network error'. This appears to be from attempts to export larger files. Small size files seem to export fine. Is there a limit here and can this be altered possibly?