Open mz8i opened 3 years ago
@polly64 could you please comment on whether it is worth having the whole history of extracts available for download from the site, or just the most recent state of the database
@mz8i I think this depends on how easy this is. If it is easier to build this in early on then worth it. If can be done later then would work on other priorities first and just allow current stuff - but also chat to Tom
@mz8i there's some interest in having the history for ease of reconstructing the history of the project, but no need for it to be daily. I suggest:
To keep extract size down, we could also reduce the edit log extract to cover only the last month of edits.
@polly64 @tomalrussell thank you for comments. This is not an immediate priority but needs to be tackled in the coming months.
I think keeping all first-of-the-month extracts, and then the latest one, is a good middle ground.
We could also separate snapshot and edit history into separate extracts. Could keep history of snapshots as above, and also have the full, most recent, edit history available for download (any slice of the history can be easily sliced from that single file).
ok great
@tomalrussell I just noticed a slightly suspicious thing related to this - looking at the bulk extracts from the previous server, the size of the zip file stopped growing around June 2020 when it reached ~2.1GB size. It seems unlikely that there was so little change since then, so I'm thinking whether we're running into some bug in the Python extract script, similar to this: https://bugs.python.org/issue24658 (although this was on OS X only). Will investigate, but any ideas/suggestions welcome.
Couple of questions to check:
7676128,2021-01-31 07:52:17+00,2407428,"{""date_lower"": 2013, ""date_upper"": 2012, ""facade_year"": 2013}","{""date_lower"": null, ""date_upper"": null, ""facade_year"": null}",AV64
, from a couple of weeks earlier, but more recent than June 2020. Were there any other entries between 31 Jan and 15 Feb?Adjacency / configuration: Mid-Terrace
- do they come up if you run the extract script again?Quick sense-check would also be to count lines in the edit_history - the one from 15 Feb gives:
wc -l edit_history.csv
7111113 edit_history.csv
The script uses postgres COPY TO CSV to write the initial extract CSV, so Python's not the problem there. We do use Python to write files into the zip - could fairly simply switch to a shell call to zip
? But I'd try and check if there's actually a problem first.
Extracting is not yet automated on the new server so the lack of more recent extracts is actually expected (we have the backed up earlier extracts, though). I will look at the contents of the extracts we have to see whether there is any problem at all.
I currently set the bulk extract frequency to weekly, it will run on every Monday morning. The backups and extracts are now scheduled on the new server. Leaving this open for any potential work on the changes to edit history exporting.
I attached a 256GB data disk to the live server to prevent it from running out of disk space for now. Longer term, I think we should switch to hosting the data downloads on a cloud service like the Azure Blob Storage, and only store the software (and potentially map tile cache) on the server VM.
@polly64 @mz8i @tomalrussell @popcorndoublefeature
thansk for the discussion. just to add, there is also the opportunity to upload specific Colouring Cities datasets to other platforms in the science to offer it persisently with DOI. Like we did for Colouring Cresden to be able to cite it:
https://zenodo.org/records/10653065 Crowd-sourced collected building attributes of the Colouring Dresden project (from 06 March 2023 to 01 October 2023)
This way could also help to avoid loss of old datasets to reduce server disk space
@traveller195 that sounds really interesting let's talk about it at next technical meeting @oriolgavalda
Describe the solution Develop a mechanism for limiting the number of data extract files stored on the server and available for download. With the current rate of growth, the production server will run out of disk space in less than 3 months.
Describe any alternatives
Additional context Initially the extracts were generated once a week, but have been generated every day since September 2020.