Closed ghost closed 10 years ago
bupdb.sh
wouldn't run and it required postgresql backend. The dump_json
and restore_json
commands work with Django and its idea of the database without being as bare metal. See #219 for more.
Django will also have direct access to the S3 bucket should it be needed for backup/restore.
dump_json
writes file names including a date, while import_json
expects the file name to lack a date. This inconsistency should be fixed. dump_json
should not include a date, but rather should create a directory named by date and dump all the files in that directory.
dump_json
does not dump the files for notes, but import_json
expects the files in a subdirectory. Unsure how these files were gathered in the past, but it appears to have been done out of band.
The only place note_file is referenced, outside of the backups, is where it is commented with "No longer keep a local copy." https://github.com/FinalsClub/karmaworld/blob/master/karmaworld/apps/notes/models.py#L149-L153
It seems like this field might be deprecated by now. It is used in about a quarter of the notes on beta, and the last time a note was uploaded to beta using that field was nearly 5 months ago.
karmaworld=# SELECT count(*) FROM notes_note WHERE note_file != '';
count
-------
246
(1 row)
karmaworld=# SELECT count(*) FROM notes_note;
count
-------
917
(1 row)
karmaworld=# SELECT max(uploaded_at) FROM notes_note WHERE note_file != '';
max
-------------------------------
2013-08-01 01:33:17.498786+00
(1 row)
There is no longer local file storage, it's all FilePicker uploads.
There is a question as to whether we want to reach out to the S3 and grab all those files (ideally using something akin to rsync to prevent redundant downloads). The database URLs are all file picker links (I'm guessing) are tied to a very specific account. If someone were using a different S3 and/or different FilePicker API key, then all the files would have to be uploaded again and all URLs would have to be updated.
I was just going to raise this issue. We ought to use AWS glacier for backups. It's so cheap and reliable. In all likelihood, anyone who forks karma notes will do likewise.
Let's discuss.
On Dec 28, 2013, at 10:38 PM, Bryan Bonvallet notifications@github.com wrote:
There is no longer local file storage, it's all FilePicker uploads.
There is a question as to whether we want to reach out to the S3 and grab all those files (ideally using something akin to rsync to prevent redundant downloads). The database URLs are all file picker links (I'm guessing) are tied to a very specific account. If someone were using a different S3 and/or different FilePicker API key, then all the files would have to be uploaded again and all URLs would have to be updated.
— Reply to this email directly or view it on GitHub.
Using Glacier to backup S3 is trivial, and I think that is a good idea.
I'll look into using glacier to backup our database information. Possibly we might want to create database dumps (as JSON or otherwise) which are pushed to the S3. Then Glacier can trivially backup our database, too. -Bryan
On Sun, Dec 29, 2013 at 4:18 PM, Andrew Magliozzi notifications@github.comwrote:
I was just going to raise this issue. We ought to use AWS glacier for backups. It's so cheap and reliable. In all likelihood, anyone who forks karma notes will do likewise.
Let's discuss.
On Dec 28, 2013, at 10:38 PM, Bryan Bonvallet notifications@github.com wrote:
There is no longer local file storage, it's all FilePicker uploads.
There is a question as to whether we want to reach out to the S3 and grab all those files (ideally using something akin to rsync to prevent redundant downloads). The database URLs are all file picker links (I'm guessing) are tied to a very specific account. If someone were using a different S3 and/or different FilePicker API key, then all the files would have to be uploaded again and all URLs would have to be updated.
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/89#issuecomment-31325555 .
Introspection 101.
introspect_models(karmaworld.apps)
)Found app models module in karmaworld.apps.courses.models
Found DB model Course
Found DB model Department
Found DB model Professor
Found DB model ProfessorAffiliation
Found DB model ProfessorTaught
Found DB model School
Found app models module in karmaworld.apps.document_upload.models
Found DB model Document
Found DB model Note
Found DB model RawDocument
Found app models module in karmaworld.apps.licenses.models
Found DB model License
Found app models module in karmaworld.apps.notes.models
Found app models module in karmaworld.apps.users.models
Found DB model KarmaUser
No more hard coding which models should be dumped.
Next step: introspect each model for fields to dump and fields to skip.
That's fun. DETAIL: Failed on request of size 19584437.
Going to have to limit how I uptake data from the database with LIMIT and OFFSET. Luckily these are implemented with the usual array syntax.
oh hey look Model.serializable_value
. should make things easier if it works. big if.
Serialization dumps the primary key id. This would be fine for a blank database. However, sometimes backups are partial restores (or in the case of #68, used to migrate data from one system to another). In these cases, it is possible for primary key collisions between data in the database and data in the files to be restored from. Additionally, foreign keys are going to assume the primary keys of the original database.
I was originally thinking we could simply dump the PKs from the restored data and generate new PKs from the auto incrementing index thing. The problem is that all FKs in the restored data would become invalid.
A quick chat with @AndrewMagliozzi suggests incrementing PK/FK integers that are restored. Andrew suggested we increment each primary key of the restore by the max index presently in the database. This could work, but it will be difficult to likewise update foreign keys, then we'd have to update the auto incrementing table to know the new highest value to use on the next new row.
I'm thinking we can increase all integer keys (both PK and FK) by some large number like a million in the restored files, which should prevent collisions with existing data. It would also be a long time until the auto incrementing index collides with the very large indices.
Andrew wanted to get your opinion, @charlesconnell , to see if you've got any other ideas.
This kind of thing is solved in sharding situations by adding a field to all tables for the originating system. It is sort of like adding a namespace to all indexes, no matter what they are. Usually each system generates some kind of unique hash (e.g. UUID1 https://en.wikipedia.org/wiki/UUID#Version_1_.28MAC_address.29) which becomes the namespace for all its indices. It can import and export namespaces of other systems, but generally prevents collision. The namespace can also be used in cases where the database goes down, comes up later, and has "forked" between the backup and the data added prior to a restore of the backup. Each time the database goes down and comes up, the system's hash would be new.
Instead of adding "one million" arbitrarily to each PK/FK, perhaps adding the datetime will work better as a distinct, large number: (YYYYMMDD0000 + key). (201401080000 + 4059) = 201401084059. Won't collide for a very long time, allows distinct restores for up to 10,000 (0000 thu 9999) records, can be performed once each day.
Presently we have hundreds of records in the largest table, so allowing for 10,000 should be good enough for some time.
floor(time.time()/3600)
gives hourly resolution with only 6 digits (SSSSSS) instead of 8 (YYYYMMDD)
Ack this incrementing thing won't work well either.
For example, License.name
is really what makes License unique (which means we should add that constraint if not there). If the License already exists (e.g. 'cc-by-nc-3.0'), either we'd add a second instance of the same thing with a huge PK id, or we'd have to FK link dependencies (like Note, which FK references License) by name and not id.
That might not be a bad idea, but it means we need to write out what is distinct/unique about various models and encode that in each model somehow (such as the Meta class). That way, the license dumped would not be {'license_id': 1} but like {'license_id': 'cc-by-nc-3.0'}. Then restoring the data could search against what is unique. This is inefficient but it would work.
Alternatively, but similarly, the restore could be done in a fresh database. Then a script could be created to identify common traits (e.g. License.name 'cc-by-nc-3.0' in both databases), link them together, and then import pre-existing data intelligently with the correct FK information.
Woop. "Natural keys" in Django's serialization junk is exactly the naming thing. https://docs.djangoproject.com/en/1.5/topics/serialization/#natural-keys "This isn’t a particularly natural way to refer to an author. It requires that you know the primary key value for the author; it also requires that this primary key value is stable and predictable.
However, if we add natural key handling to Person, the fixture becomes much more humane."
It's already done already. Great minds think alike!
Gotta write get_by_natural_key()
and natural_key()
into each model.
Sweet
On Jan 8, 2014, at 9:29 PM, Bryan Bonvallet notifications@github.com wrote:
Woop. "Natural keys" in Django's serialization junk is exactly the naming thing. https://docs.djangoproject.com/en/1.5/topics/serialization/#natural-keys "This isn’t a particularly natural way to refer to an author. It requires that you know the primary key value for the author; it also requires that this primary key value is stable and predictable.
However, if we add natural key handling to Person, the fixture becomes much more humane."
It's already done already. Great minds think alike!
Gotta write get_by_natural_key() and natural_key() into each model.
— Reply to this email directly or view it on GitHub.
this will probably be useful for import. http://stackoverflow.com/questions/5940294/import-data-into-django-model-with-existing-data
Also handy, Django already provides a way to dump data from all models. https://docs.djangoproject.com/en/1.5/ref/django-admin/#django-admin-dumpdata
Guess all my introspection magic might not have been worthwhile, but it'll be nice to clean up some of this backup stuff in favor of Django built-ins.
Unfortunately I'm seeing a bunch of places where we need unique constraints and some places where the unique constraint no longer makes sense.
The unique constraint is going along with the natural_key
stuff, being that the natural_key
should be what is unique about a field.
database integrity. makes or breaks the world. Each course name should be unique with respect to its department. Unfortunately, there 4 courses (out of 94) which have two rows defined each which break this potential constraint. Gotta clean up my database.
name | department_id | num
--------------------------------------------------------------------------------------+---------------+-----
Statistical Learning Theory and Applications | 3 | 2
Introduction to Algorithms | 4 | 2
Computational Cognitive Science | 3 | 2
Elements of Software Construction | 4 | 2
Thankfully Beta is clean. There is no department (so it will always be null), but each course name on Beta occurs just one time. So this constraint will fit in there just fine.
My database is clean, I ran the migrations with all the unique mods, and restarted the whole VM server using the models retrofitted with crazy amounts of natural_key
support. Took forever to type it all up, but it looks like I didn't make any horrible typos because the server runs.
Now to test if dumping uses the natural_key
s appropriately. Cross fingers!
[{"pk": 1, "model": "courses.professortaught", "fields": {"course": ["Introduction to Psychology", ["Brain and Cognitive Sciences", [121415]]], "professor": ["", null]}}]
So this is ugly but super nested and no foreign key numbers.
ProfessorTaught
is a relation that uniquely defines a (course, professor)
tuple, both of which are foreign to the relation. In the above JSON, Course
is uniquely defined by (name, department)
. It just so happens Department
is a foreign key defined by (name, school)
, and School
is a foreign key defined by (USDE_id)
. This can be seen nested inside the Course description. Professor is defined by (name, email)
, and in this case happens to be pointing at the empty professor found to be teaching 45 courses of the 90.
YAY!
This was with my own dump code that dumps only a single entry from each model. Now I should try to dump using Django's dumpdata thingy, then create a blank database and see if I can restore it on my VM.
This is all nice for foreign keys, but the primary key id is still in each entry. In my reading, it seemed like the primary key will be whatever it needs to be, but it bothers me that it is sitting there in the JSON and that the dumpdata
command has no particular flags about PKs. The --natural
seems to discuss FKs only.
(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ python manage.py dumpdata --format=json --natural
Killed
great.
Small scale works. Apparently it dumps to STDOUT. I was wondering what file it arbitrarily picked, but I guess that makes sense.
(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ python manage.py dumpdata licenses --format=json --natural
[{"pk": 1, "model": "licenses.license", "fields": {"html": "<a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc/4.0/\"><img alt=\"Creative Commons License\" style=\"border-width:0\" src=\"http://i.creativecommons.org/l/by-nc/4.0/88x31.png\" /></a>", "name": "cc-by-nc-3.0"}}]
Looks like notes are the problem. Too much data for my little VM. I might have to make it bigger.
(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ python manage.py dumpdata --format=json --exclude=notes --natural | gzip > dbdump.json.gz
(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ ls -lh
...
-rw-rw-r-- 1 vagrant vagrant 336K Jan 9 05:58 dbdump.json.gz
...
(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ python manage.py dumpdata notes --format=json --natural | gzip > dbdumpnotes.json.gz
Traceback (most recent call last):
...
django.db.utils.DatabaseError: out of memory
DETAIL: Failed on request of size 16221250.
This is going to be a problem for any system. I just spiked my VM up to 5 GiB and still getting memory errors when dumping notes.
(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ python manage.py dumpdata notes --format=json --natural | gzip > dbdumpnotes.json.gz
Traceback (most recent call last):
...
File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 832, in <lambda>
result = iter((lambda: cursor.fetchmany(GET_ITERATOR_CHUNK_SIZE)),
File "/var/www/karmaworld/venv/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
MemoryError
I've been reading up on a ticket re: Django's dumpdata consuming too much memory: https://code.djangoproject.com/ticket/5423
Ticket was closed 20 months ago many commenters noted an improvement in PostgreSQL performance. From the patch tags, it looks like this patch was pulled into Django 1.5 (we're running Django 1.4.x) https://github.com/django/django/commit/3b5083bee5e96539dec599106aece9889e70ce05
I'm not sure if updating to Django 1.5 would break a bunch of things, but it'd fix at least one thing. From the docs, the current Django dev is considerably different from what we're currently running, which appears to be 1.6.x.
weird. I updated to Django 1.5 and ran the command. It ran for awhile, then gave a weird error, but managed to dump 223 MiB worth of gzipped JSON. (317 MiB JSON file after decompressing).
(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ python manage.py dumpdata --format=json --natural | gzip > dbdumpnotes.json.gz
CommandError: Unable to serialize database:
(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ ls -lh
...
-rw-rw-r-- 1 vagrant vagrant 223M Jan 12 08:43 dbdumpnotes.json.gz
...
The error is very cryptic in the sense that it doesn't say a darned thing :(
Figured I'd try doing a restore anyway to see how much got dumped, but ran into a problem referenced in the ticket link above.
How to restore dumps: https://docs.djangoproject.com/en/1.5/ref/django-admin/#loaddata-fixture-fixture
Well this was about equally successful as the dump. Non-error error.
(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ python manage.py loaddata dbdumpnotes.json
DeserializationError: Problem installing fixture 'dbdumpnotes.json':
(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$
Unlike the file dump, which actually dumped qute a lot, nothing at all got imported. No schools, no courses, no notes.
No error.
There is a --traceback
switch for debugging dumpdata.
http://stackoverflow.com/questions/6680283/troubleshooting-error-unable-to-serialize-database-when-performing-dumpdata
Same 317 MiB JSON dump, so probably same error as before. Error is MemoryError. Well darn. I'm running 3.4 GiB (not the 5 GiB I had thought, but still).
(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ python manage.py dumpdata --format=json --natural --traceback > db.json
Traceback (most recent call last):
...
File "/var/www/karmaworld/venv/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
MemoryError
(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ head -2 /proc/meminfo
MemTotal: 3616252 kB
MemFree: 3272980 kB
(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ ls -lh
-rw-rw-r-- 1 vagrant vagrant 317M Jan 13 03:02 db.json
Might need to write some custom serializer for handling notes if I can't figure this out :(
this can't hurt, but I don't see how it would help (using prettifier JSON serialization): https://djangosnippets.org/snippets/2916/
Then again the above is labeled as helping UTF8, and the OOM error occurs during UTF8 stuff according to the traceback. Worth a shot.
Quickly tested XML and YAML to make sure somehow it wasn't the serializer output choice causing OOM. XML and YAML both MemoryError before completion.
The UTF-8 fix had some promise because it'd compress UTF-8 characters into 1-2 bytes instead of ASCII strings of \uXXXX
(6 bytes). But alas, still with the MemoryError exactly the same place.
In fact, the error was in codecs.utf_8_decode(input, errors, True)
, which makes me wonder if it was still trying to convert UTF8 to ASCII.
Might still be some memory problems in Django ORM from this: https://code.djangoproject.com/ticket/16614
Attempts to more directly dump data via python manage.py shell
have failed:
from django.core import serializers
from karmaworld.apps.notes.models import Note
with open('dbnotes.json', 'w') as out:
serializers.serialize("json", Note.objects.iterator(), stream=out)
Traceback (most recent call last):
File "<console>", line 2, in <module>
File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/core/serializers/__init__.py", line 99, in serialize
s.serialize(queryset, **options)
...
File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 58, in execute
six.reraise(utils.DatabaseError, utils.DatabaseError(*tuple(e.args)), sys.exc_info()[2])
File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 54, in execute
return self.cursor.execute(query, args)
DatabaseError: out of memory for query result
This is different. This out of memory is coming from the database itself.
How bad is the database? Should we really still be keeping text and html in there? (probably not)
karmaworld=> SELECT max(length(html)) FROM notes_note;
max
----------
20465234
(1 row)
karmaworld=> SELECT max(length(text)) FROM notes_note;
max
-------
34141
(1 row)
The largest text field is 33 KiB. The largest html field is 19.5 MiB. How much of this data is there in the database anyway?
karmaworld=> SELECT sum(length(text)) FROM notes_note;
sum
---------
3112097
(1 row)
karmaworld=> SELECT sum(length(html)) FROM notes_note;
sum
------------
1482215481
(1 row)
In my VM, which has barely begun to scratch the surface of the OCW notes, there 2.96 MiB of text and 1.38 GiB of HTML.
I think it's safe to say the problem is not backup or Django, but rather, our abuse of the Relational Database for storing textual document data.
much words very mongo wow
So everything looks much better now.
Let's start testing [dump|load]data
.
Write some code to push the dumped data onto S3.
Glacier!
Also we should automate via beat/celery/whatever.
Tried to dumpdata from Production, but ran into a NoneType error. Managed to dump 4.4 MiB of JSON. I'm not sure what model was yielding None to break the dump.
(venv)djkarma@ip-10-28-106-115:/var/www/karmaworld$ python manage.py dumpdata --format=json --natural > ~/20140122.dumpdata.json
CommandError: Unable to serialize database: 'NoneType' object has no attribute 'natural_key'
-rw-rw-r-- 1 djkarma djkarma 4.4M Jan 23 06:33 20140122.dumpdata.json
Ran the command again with --traceback
.
File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/core/serializers/python.py", line 57, in handle_fk_field
value = related.natural_key()
File "/var/www/karmaworld/karmaworld/apps/courses/models.py", line 243, in natural_key
return (self.name, self.department.natural_key())
AttributeError: 'NoneType' object has no attribute 'natural_key'
Looks like a Course has a null department, which makes sense because courses added with the form still add School but not Department. I guess that natural key for Course is still premature :(
While testing #307, I needed some data in the models. The courses app had everything I needed, plus schools which I already had. So I dumped it from prod and restored it on my VM. Instead of being clever about not adding schools that were already present, it collided:
(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ python manage.py loaddata courses.datadump.json
IntegrityError: Problem installing fixture 'courses.datadump.json': Could not load courses.School(pk=1): duplicate key value violates unique constraint "courses_school_usde_id_uniq"
DETAIL: Key (usde_id)=(121150) already exists.
There does not appear to be a way for loaddata to ignore existing keys. Barf. Not nice for merging data fo sho. https://docs.djangoproject.com/en/dev/ref/django-admin/#django-admin-loaddata
Cleared the VM database to remove collisions, now getting this error. I'd guess its from trying to deserialize a natural_key, but that is supposed to be smart. The backtrace isn't very helpful in determining the actual problem.
(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ python manage.py loaddata courses.datadump.json --traceback
Traceback (most recent call last):
File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/core/management/base.py", line 222, in run_from_argv
self.execute(*args, **options.__dict__)
File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/core/management/base.py", line 255, in execute
output = self.handle(*args, **options)
File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/core/management/commands/loaddata.py", line 187, in handle
for obj in objects:
File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/core/serializers/json.py", line 75, in Deserializer
raise DeserializationError(e)
DeserializationError: Problem installing fixture 'courses.datadump.json': int() argument must be a string or a number, not 'list'
Oh the problem is likely due to Course using school as a FK, which OCW import ignores. Related to #294. If that isn't the problem, it will be a problem.
This transition from Course/School to Course/Department/School should have been made much more rigid :(
This ticket was relevant for live restores wherein we'd kick on the service possibly before the restore had completed, to minimize downtime.
At this point, it seems like we're relying on postgres out of Heroku rather than Django's South interface. We can't do much to minimize downtime in booting/restoring a Heroku app, and even if we could, I don't think this ticket would apply to that situation.
I vote this ticket is no longer relevant.
Agreed.
On Tue, May 27, 2014 at 4:41 PM, Bryan Bonvallet notifications@github.comwrote:
This ticket was relevant for live restores wherein we'd kick on the service possibly before the restore had completed, to minimize downtime.
At this point, it seems like we're relying on postgres out of Heroku rather than Django's South interface. We can't do much to minimize downtime in booting/restoring a Heroku app, and even if we could, I don't think this ticket would apply to that situation.
I vote this ticket is no longer relevant.
— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/89#issuecomment-44332415 .
Re-visit BUP / S3 backups. Evaluation will be to see how backups are created and how we can simplify them (perhaps maybe including .json dumps). Other ideas are on the table.