Open amplifi opened 7 years ago
This also affects the regular project data download. When project resources are exported, the exporter downloads all resource files from S3 to a local directory and then zips them up.
Here is where the regular exporter goes through each resource file to zip them up: https://github.com/Cadasta/cadasta-platform/blob/master/cadasta/organization/download/resources.py#L72
And here is where each resource file is downloaded to the local directory in django-buckets: https://github.com/Cadasta/django-buckets/blob/master/buckets/storage.py#L33-L49
I've modified the title to cover both cases. @dpalomino this is added to the backlog for prioritization.
As suggested by @amplifi, the better process is to never download the resource files to the local server. They should be retrieved from S3 one at a time, zipped on-the-fly, and the zip file is streamed into S3 from which the user can then download the zip file once it is complete. This can be done as part of #490.
We need to either fix this bug for the release or remove the export code from the build. This isn't safe for deployment in its current state.
I believe that I could resolve this in #1400. Additionally, it feels that #490 is superseded by #1400 as well (are there any others that I'm unaware of?) I chatted briefly with @amplifi about this today. Admittedly, this is what I had planned on also doing on the worker tooling as well (it's how I've always dealt with file-processing workers). She tipped me off to an alternative which I can attempt to implement (I think I would change some things around for our needs).
We need to either fix this bug for the release or remove the export code from the build. This isn't safe for deployment in its current state.
@amplifi Is this new in any way? Isn't this already released and in production?
@alukach The export all project data code is already live and should be reverted, as well. That functionality times out during the export process, and it's less likely to be used as frequently (there have been no attempts other than ours to use it in demo/prod to date). The search exports is a significantly higher risk due to the frequency with which users could/would request exports of their results, and the fact that their export(s) would like contain a subset of total project resources; if the export doesn't time out as with all-data exports, then there are more successful search exports run thus more resource files downloaded to the local disk.
@amplifi Thanks for the clarification, I somehow missed the fact that this is about search exports rather than general exports. I no longer believe that #1400 would resolve this issue (unless the scope of #1400 changes). Eventually, I think we could follow the same pattern as #1400 for this feature.
To be honest, I don't completely understand what you mean by "the frequency with which users could/would request exports of their results", how that differs from any other type of export (maybe this isn't even important). Looking at this code, it seems like the biggest issue is the fact that there's no cleanup of the download contents (unless I'm also missing that). This would mean that it's inevitable that the HDD would fill up, regardless of frequency (it's simply a matter of time).
EDIT:
Looking at the code, is this also true for general exports? It feels like I may just not being seeing where packaged exports get deleted, however I can't seem to find this logic for either export functionalities.
@alukach The issue is about both all-data exports and search exports (the latter reused the code/method from the former).
There is a daily cronjob that cleans up the temp folder, as it is a subfolder of MEDIA_ROOT
, see https://github.com/Cadasta/cadasta-platform/blob/689d0ef42d6f03b8a3fd608a37c80b836b77f2bd/provision/roles/webserver/production/tasks/main.yml#L104 This job was configured to clean up after the thumbnail generation process for image resource files uploaded via the UI. I wasn't advised that either export process would be downloading files to the local disk, and wasn't a reviewer on the corresponding PRs, so use of this folder for exports is news to me and the daily cleanup would be inadequate for that expanded scope.
The rest might sound counter-intuitive at first. There are two factors contributing to the increased risk for search exports vs all-data; (1) whether or not the export succeeds, and (2) likely user behavior. For (1), we have an existing issue where all-data exports fail due to time out, which was the urgent driver behind refactoring import & export to be asynchronous. That's why the all-data export process downloading files to local disk hasn't caused problems in production envs yet; the feature breaks. The search export feature exports a subset of project data with each run, and doesn't time out, so resource files will be downloaded to the local disk with each search export request. For (2), if a user is exporting search results, it's expected that they'll search, trigger an export, possibly modify the search query, trigger another export, etc. It's also more likely that a user will need to obtain an export of a subset of all project data (via search exports) than to export the entire project data set (via all-data exports), so we should expect more requests for search exports than for all-data exports, in general. Once they've exported all project data once, there's nothing more for them to export, done. With search exports, there's always going to be more data available for export than is contained in any single search export.
@amplifi Got it, thanks. How would you feel if the solution for either types of export was to send the completed file to S3 and delete the local files immediately at the end of the web-request/asynchronous task?
@alukach Any solution would have to avoid downloading the resource files to local disk. If a single project can have ~500k resources, it's not hard for a subset of those to reach a cumulative size on disk that would create space issues (the footprint per export request would be the total sum of individual resource file size plus the size of the resultant zip file), and that's before getting into issue of requests whose runtime overlaps (multiple users across projects submitting export requests, or a single user submitting multiple export requests). The downloading of these files and subsequent transfer back to S3 of the zip also monopolizes platform I/O resources and could result in degraded platform performance for all users; it's more problematic and volatile than general scaling questions because of the disproportionate amount of server resources the export processes require and the unpredictability of when the export requests will be triggered. But at this stage that takes a backseat to the local disk.
Ideally, the export .zip file should be built away from the platform server (as was the plan for the async I/O pipeline). As a bare minimum, if we absolutely have to create the csv files from the platform server (this process will get increasingly intensive the more records we have[1]), then create the csvs, zip them, shunt that off to S3, delete the temp files for that job, and continue assembling the remainder of the zip file contents as planned for async I/O... having a separate service outside the platform that parses the list of corresponding resource filenames, and adds those resource files directly into the existing zip with the csvs. Then the user downloads the export zip file straight from S3. There's no need to build the zip all in one place.
[1] Guideline: If the platform would struggle to display party/location/resource views using front-end pagination only, the platform will struggle to build csv files for a project export containing the same quantity of records. It's the same general process of query all the things, hold them locally, and manipulate the data.
A minor complication is that in S3 we cannot append to an existing file.[ref] So either you build the complete zip file somewhere and then upload that to S3, or you stream the zip file to S3 as it is being zipped to minimize using local storage.
@seav I never suggested appending to an existing file in S3. Please re-read my comment. I suggested having a service outside the platform server that performs the operation of adding the resource files to a zip file with the csvs. If you reference the link I sent to you and @alukach yesterday, which is also linked above in an earlier comment in this thread, you'll see how this can be done for resources without having to download files. There's also no need to stream anything across the platform server. That can all be handled externally.
Flagging for the future: we'll also want to consider the similar but lesser scale of impact for large shape files.
For sake of discussion, how would people feel about a temporary patch for this:
Doing this may be faster than waiting for the completion of #1400.
Ultimately, there are always going to be limits when dealing with downloads. In my mind, this will be either temporarily occupying disk space and then a short-term usage of a block of memory to construct the zip or to occupy a block of memory for longer amounts of time while you stream files into the zip (which is technique I'm not really familiar with). Breaking it off into a separate service and using a queue to limit throughput will add a lot of stability to the system and minimize negative impact on other services, however I think we'll always be beholden to our available disk-space and memory. I don't know if we'll ever be able to handle creating a zip 1MB larger than either the RAM or disk-space of the machine generating the zip. Maybe I'm being short-sighted and there's some zip-on-the-fly-whilst-also-streaming-to-S3 workflow that could handle really huge amounts of data, but in an attempt to be pragmatic I would recommend settling on a "good enough for now, revisit if it looks like it's close to causing issues" solution if we are hoping to get this out any time soon.
@alukach another temporary patch @oliverroick brought up was instead of actually download the files providing links to them. Thoughts?
@wonderchook @oliverroick @alukach I would prefer providing links to resources in the csv instead of including the actual files in the zip. It's a cleaner fix, no technological trade-offs, still provides the same content without disabling the feature, and doesn't require the re-work of the other options. I think continuing to offer it as an export option could be beneficial to our partners on slow connections, as well.
There is a usability issue with just providing the URLs to the resource files as they are stored in S3: the files will be downloaded under their randomized filenames instead of their original filenames. See related issue #778.
I think I misunderstood the need for downloading data. I assumed that we were generating the CSVs in the Request/Response cycle. Are they already on S3 and we're just zipping them up with a Excel document? If so, then we should definitely just provide the links to the user. 👍
@seav HTML5 introduced a download
attribute for a
tags, so we should be good on setting the filename (assuming that it's passed back from the server with the URL to the S3 resource).
@alukach, when people try to download a resource file from the resource detail page on our platform, we are already setting the download
attribute to the original filename as can be seen here: https://github.com/Cadasta/cadasta-platform/blob/master/cadasta/templates/resources/project_detail.html#L59
This works on the dev VM when we are using the fake S3 storage, but doesn't work when using the actual S3 storage. Hence issue #778 was filed. Please check out the conversation there for details.
See https://github.com/Cadasta/cadasta-platform/issues/778#issuecomment-299613597
There are existing solutions to these problems. We should focus on implementing this properly; every kludge makes the platform less stable and future development more painful.
@alukach another temporary patch @oliverroick brought up was instead of actually download the files providing links to them. Thoughts?
I think that would make things much more complicated for the users. Just with a project with a couple of hundreds of resources, the user will need to spend much time downloading the resources individually. And with project with thousands of records would be really difficult for a regular user to retrieve all the resources. They would need some knowledge to create a simple script to retrieve all resources from the list of links, and this is something that we cannot expect from our regular users.
@dpalomino do you have any specific use cases of how people use the export function? If someone wanted to export all the resources what would they be doing?
@wonderchook, @dpalomino: I think the largest use case for exporting resources is to bulk print. There's currently no easy way to bulk print all resources, deeds or titles, for example in Cadasta, right?
@SteadyCadence okay so it would be good to capture also that use case as a feature request. Since I think downloading all the resources and individually printing them is sorta a work around for that as well.
Not saying we shouldn't allow bulk download, we should also get to the root problem users are trying to solve.
@SteadyCadence okay so it would be good to capture also that use case as a feature request. Since I think downloading all the resources and individually printing them is sorta a work around for that as well.
Not saying we shouldn't allow bulk download, we should also get to the root problem users are trying to solve.
I'd mention the following use cases (apart from the printing deeds or other documents):
With the implementation of #490, this problem should be alleviated a bit. To be revisited then.
Fix with async exports
@oliverroick I actually think this is about search exports and is thus still outstanding. Feel free to close if you disagree.
https://github.com/Cadasta/cadasta-platform/blob/master/cadasta/search/export/resource.py#L56
Search export as currently implemented uses
curl
to download all project resource files from S3 to the local server to build the .zip file. For any project with the typical volume of resources expected in production, this operation will result in exceeding the server's available local disk space, and cause a total platform outage. This feature should not be deployed until implementation is altered.