SEL-Columbia / formhub

Mobile Data Collection made easy.
http://formhub.org
BSD 2-Clause "Simplified" License
259 stars 163 forks source link

Cleanup export temp attachment files #1326

Open dpapathanasiou opened 10 years ago

dpapathanasiou commented 10 years ago

Changed create_attachments_zipfile() to stop leaving behind zip files in the system tmp folder which never get cleaned up.

dpapathanasiou commented 10 years ago

It's not properly tested (b/c there are no good data sets for dev) but the changes are straightforward, and in the worst case scenario, someone's attached file export will fail.

But on the plus side, /tmp will no longer fill up with unnecessary files, and push the server to 0% free disk space.

ukanga commented 10 years ago

I see that this still writes to the file system in the case of an export, this essentially has the same effect as the tmp folder with the disadvantage that now even after a reboot the file will still exist. Would it be ideal to use django storage here so that it can be saved on whichever default storage is in use e.g S3, rather than on the root file system always.

dpapathanasiou commented 10 years ago

Either you didn't read the code correctly, or you didn't understand it.

In the current master, you create a temporary zip file, but then set delete to False[1], which goes against the default[2], and thus leaves the file in /tmp, without removing it.

Instead of passing back the data stream, so that temp storage can cleanup the file on its own, all you pass back is the temporary file name, and then you have to do a read from disk again.

So when an http download request happens through zip_export(), you create a zip file in /tmp, read it back into memory from the file system, respond to the http request, but leave it in /tmp because you have no way of handling the post-response condition.

Worse, when you call create_attachments_zipfile() from generate_attachments_zip_export() you create a zip file in /tmp, read the that zip file into memory so that you can make a new file (a copy!) elsewhere in the filesystem[3] and you leave the zip file in /tmp!

You could have just at least moved the file from /tmp, without creating a copy unnecessarily.

Abusing S3 and accumulating superfluous space and bandwidth costs b/c you do not use temp storage properly is not a good idea, either; it's just sweeping the problem under the rug.

Anyway, I am particularly upset since Forhub.org crashed on Monday because of this nonsense: we went from 80 gb free to 0 in a very short time.

My revisions do not leave files in /tmp after they are created, if you look at them carefully.

Also, there is no such thing as a file surviving in /tmp in ubuntu after reboot (which is fortunate, b/c it was the only way we recovered on Monday).

[1] https://github.com/SEL-Columbia/formhub/blob/9062e9dca266d6285e5de91d1f6a40fb5f02cfa4/utils/viewer_tools.py#L201 [2] https://docs.python.org/2/library/tempfile.html#tempfile.NamedTemporaryFile [3] https://github.com/SEL-Columbia/formhub/blob/5e8e176cd19162389ecaaeecc9a5d0bf40299e4d/utils/export_tools.py#L691

prabhasp commented 10 years ago

The filesystem storage vs. non-storage discussion is a bit over my head. Sorry I don't have the time to dig in fully at the moment.

My one comment on the pull request (PR) is to delete this test if it is no longer relevant. It is failing on travis, and seems to be no longer relevant. (There are two others, but they are not related to the functionality in question in the PR).

dpapathanasiou commented 10 years ago

So that test is still valid, though its secondary assertion[1] is technically incorrect.

According to the relevant standard[2], an attached (as opposed to an inlined) file should also suggest the filename in the same string.

So the test failed b/c 'attachment; filename=transportation_2011_07_25.zip' != 'attachment;' but the test assertion should be rewritten.

I also did a code review with @chrisnatali and @csytan who made some good suggestions that I will implement in this branch.

So let me make those updates (along with revising the test) and test on dev before this is merged.

[1] https://github.com/SEL-Columbia/formhub/blob/9062e9dca266d6285e5de91d1f6a40fb5f02cfa4/main/tests/test_form_exports.py#L121 [2] http://tools.ietf.org/html/rfc2183

dpapathanasiou commented 10 years ago

The challenge now is testing this properly; I deployed this branch via fabric to dev.formhub.org but I haven't been able to confirm completely that it would work properly on the regular formhub.org.

Any suggestions?

prabhasp commented 10 years ago

My suggestion is to link it to the s3 account where you cloned ossap photos not too long ago, and trying to do an export on one of the ossap datasets. On Jun 5, 2014 4:45 PM, "Denis Papathanasiou" notifications@github.com wrote:

The challenge now is testing this properly; I deployed this branch via fabric to dev.formhub.org but I haven't been able to confirm completely that it would work properly on the regular formhub.org.

Any suggestions?

Reply to this email directly or view it on GitHub https://github.com/SEL-Columbia/formhub/pull/1326#issuecomment-45272263.

dpapathanasiou commented 10 years ago

That is long gone, though I could spin up that image quickly enough.

Also, TBH, I'm not sure I'm triggering the export correctly through the UI, so let's go over that tomorrow.

dpapathanasiou commented 10 years ago

I'm not sure what you're asking here.

The only point of this branch is to fix the function called create_attachments_zipfile[1] which creates zip files in /tmp and leaves them there, because of how the NamedTemporaryFile is called[2].

Since create_attachments_zipfile is called in two places in the code[3], I merely changed those interfaces to accept the file-like object TemporaryFile returns, rather than the string of the filename.

In doing it this way, we do not leave a file in /tmp after the request is completed.

Now, you seem to be asking me what the logic inside generate_attachments_zip_export is doing.

I should be asking you that.

I have no idea. I'm merely preserving the existing interfaces so that export will continue to do what it's been doing.

A full refactoring (including untangling all the Byzantine logic in export and elsewhere) may not be in the cards.

My only priority in this branch is making sure we never run out of disk space because /tmp filled up with zip files (this actually happened this past Monday).

[1] https://github.com/SEL-Columbia/formhub/blob/master/utils/viewer_tools.py#L199 [2] https://github.com/SEL-Columbia/formhub/blob/master/utils/viewer_tools.py#L201 [3] https://github.com/SEL-Columbia/formhub/blob/a5b6ef7e11b42cbc8e05a2af57054af73374d801/odk_viewer/views.py#L564 https://github.com/SEL-Columbia/formhub/blob/5e8e176cd19162389ecaaeecc9a5d0bf40299e4d/utils/export_tools.py#L679

dpapathanasiou commented 10 years ago

BTW, while I was going about figuring out how to test this pull request properly, I noticed that you use delete=False for NamedTemporaryFile in at least three other places[1],[2],[3].

Only in one of those cases do you cleanup the temporary file correctly[4].

So even the function I just fixed, you're calling attachment.full_filepath[5] on each attachment, which means that even though the temporary zip file will get cleaned up, none of the media files that compose it won't, because if they were originally from S3, they've just been freshly written to /tmp using the delete=False flag!

This is just astounding...

[1] https://github.com/SEL-Columbia/formhub/blob/9062e9dca266d6285e5de91d1f6a40fb5f02cfa4/odk_logger/models/attachment.py#L45 [2] https://github.com/SEL-Columbia/formhub/blob/5e8e176cd19162389ecaaeecc9a5d0bf40299e4d/odk_viewer/models/export.py#L151 [3] https://github.com/SEL-Columbia/formhub/blob/a5b6ef7e11b42cbc8e05a2af57054af73374d801/odk_viewer/views.py#L649 [4] https://github.com/SEL-Columbia/formhub/blob/a5b6ef7e11b42cbc8e05a2af57054af73374d801/odk_viewer/views.py#L655 [5] https://github.com/SEL-Columbia/formhub/blob/9062e9dca266d6285e5de91d1f6a40fb5f02cfa4/utils/viewer_tools.py#L207