Limiting server disk space used by data extract downloads

mz8i commented 3 years ago

Describe the solution Develop a mechanism for limiting the number of data extract files stored on the server and available for download. With the current rate of growth, the production server will run out of disk space in less than 3 months.

Describe any alternatives

upload the extracts to a cloud storage with cheap disk space, and serve the downloads from there, not using up production server disk space
don't keep anything older than the latest extract. The extracts are a separate mechanism from backups, and the value for users of having daily history of extracts available for download is not entirely clear.

Additional context Initially the extracts were generated once a week, but have been generated every day since September 2020.

mz8i commented 3 years ago

@polly64 could you please comment on whether it is worth having the whole history of extracts available for download from the site, or just the most recent state of the database

polly64 commented 3 years ago

@mz8i I think this depends on how easy this is. If it is easier to build this in early on then worth it. If can be done later then would work on other priorities first and just allow current stuff - but also chat to Tom

tomalrussell commented 3 years ago

@mz8i there's some interest in having the history for ease of reconstructing the history of the project, but no need for it to be daily. I suggest:

run "latest" extract nightly
keep monthly extracts over the full project history

To keep extract size down, we could also reduce the edit log extract to cover only the last month of edits.

mz8i commented 3 years ago

@polly64 @tomalrussell thank you for comments. This is not an immediate priority but needs to be tackled in the coming months.

I think keeping all first-of-the-month extracts, and then the latest one, is a good middle ground.

We could also separate snapshot and edit history into separate extracts. Could keep history of snapshots as above, and also have the full, most recent, edit history available for download (any slice of the history can be easily sliced from that single file).

polly64 commented 3 years ago

ok great

mz8i commented 3 years ago

@tomalrussell I just noticed a slightly suspicious thing related to this - looking at the bulk extracts from the previous server, the size of the zip file stopped growing around June 2020 when it reached ~2.1GB size. It seems unlikely that there was so little change since then, so I'm thinking whether we're running into some bug in the Python extract script, similar to this: https://bugs.python.org/issue24658 (although this was on OS X only). Will investigate, but any ideas/suggestions welcome.

tomalrussell commented 3 years ago

Couple of questions to check:

latest extract linked from https://colouring.london/data-extracts.html is from 15 Feb 2021 - are there any more recent extracts on the server that should be listed?
latest entry in the edit history in that extract is 7676128,2021-01-31 07:52:17+00,2407428,"{""date_lower"": 2013, ""date_upper"": 2012, ""facade_year"": 2013}","{""date_lower"": null, ""date_upper"": null, ""facade_year"": null}",AV64, from a couple of weeks earlier, but more recent than June 2020. Were there any other entries between 31 Jan and 15 Feb?
latest edits as of now are a bunch of Adjacency / configuration: Mid-Terrace - do they come up if you run the extract script again?

Quick sense-check would also be to count lines in the edit_history - the one from 15 Feb gives:

wc -l edit_history.csv 
7111113 edit_history.csv

The script uses postgres COPY TO CSV to write the initial extract CSV, so Python's not the problem there. We do use Python to write files into the zip - could fairly simply switch to a shell call to zip? But I'd try and check if there's actually a problem first.

mz8i commented 3 years ago

Extracting is not yet automated on the new server so the lack of more recent extracts is actually expected (we have the backed up earlier extracts, though). I will look at the contents of the extracts we have to see whether there is any problem at all.

mz8i commented 3 years ago

I currently set the bulk extract frequency to weekly, it will run on every Monday morning. The backups and extracts are now scheduled on the new server. Leaving this open for any potential work on the changes to edit history exporting.

mz8i commented 3 years ago

I attached a 256GB data disk to the live server to prevent it from running out of disk space for now. Longer term, I think we should switch to hosting the data downloads on a cloud service like the Azure Blob Storage, and only store the software (and potentially map tile cache) on the server VM.

traveller195 commented 7 months ago

@polly64 @mz8i @tomalrussell @popcorndoublefeature

thansk for the discussion. just to add, there is also the opportunity to upload specific Colouring Cities datasets to other platforms in the science to offer it persisently with DOI. Like we did for Colouring Cresden to be able to cite it:

https://zenodo.org/records/10653065 Crowd-sourced collected building attributes of the Colouring Dresden project (from 06 March 2023 to 01 October 2023)

This way could also help to avoid loss of old datasets to reduce server disk space

polly64 commented 7 months ago

@traveller195 that sounds really interesting let's talk about it at next technical meeting @oriolgavalda

colouring-cities / colouring-core

Limiting server disk space used by data extract downloads #641