Signbank / Global-signbank

An online sign dictionary and sign database management system for research purposes. Developed originally by Steve Cassidy/ This repo is a fork for the Dutch version, previously called 'NGT-Signbank'.
http://signbank.cls.ru.nl
BSD 3-Clause "New" or "Revised" License
19 stars 12 forks source link

Move to MySQL #673

Open Woseseltops opened 4 years ago

Woseseltops commented 4 years ago

In this issue I want to investigate what should happen for us to move to MySQL so we can get rid of #441 . This is what I could come up with.

Phase 1: preparation

  1. Request the database at C&CZ

Phase 2: migration

  1. Turn off Signbank
  2. Dump the data
  3. Change the settings to use the new database, and link it correctly
  4. Run migrate so the MySQL database has the correct structure
  5. Load the data

Expected problems:

@susanodd and @vanlummelhuizen , I really need your help here :)

susanodd commented 4 years ago

Thumbs up!

Woseseltops commented 4 years ago

Can you think of anything that will go wrong @susanodd ?

susanodd commented 4 years ago

Recall we've had problems getting the test database to work. There were some essential tables that needed to remain non-empty, and others that connect objects in different tables. Perhaps first try exporting to see what the SQL creation commands look like that will refill the tables? There was also a lot of Django-specific stuff that had accumulated. Recall that you discovered this when we first started testing? You deleted gigabytes of junk. The migrations have very old "initialization" stuff in them (when the choice lists were first created). Some of that is actually still in the Django part. Later we made a table for field choices. But before that, this was inside the python migrations that creates stuff. Recall that when we make migrations now, we delete tons of stuff from the automatically generated ones.

It seems like it's important to make sure a bunch of obsolete Django stuff isn't carried over into a fresh database. And that the migrations are pruned down. We only want the newest database, not an evolution derived structure.

Woseseltops commented 4 years ago

Recall we've had problems getting the test database to work. There were some essential tables that needed to remain non-empty, and others that connect objects in different tables.

Ah right, good point @susanodd , there is a test-database, which is a light-weight version of the full database! So I should request two databases.

Recall that when we make migrations now, we delete tons of stuff from the automatically generated ones. And that the migrations are pruned down. We only want the newest database, not an evolution derived structure.

Ah yes doing database migrations is not super clean these days. In the past, we have at some point 'deleted' previous migrations and started fresh. I think you are right that this is good time for such a restart to prevent problems. Something like this: https://simpleisbetterthancomplex.com/tutorial/2016/07/26/how-to-reset-migrations.html

Woseseltops commented 4 years ago

Fun fact: just made a test dump of the database (output is json), and it's almost 1 GB !

susanodd commented 4 years ago

Gosh!

I just read the how to reset migrations. I'm wondering now about the Django Guardian stuff?

susanodd commented 4 years ago

@Woseseltops , what kind of output are you generating for export? You wrote json. Is it possible to just use a database viewer program and export the tables as SQL insert commands, and the SQL tables as CREATE commands? There are a lot of tables. But if they are exported as SQL commands, then the data should retain its type. And if all the tables are exported from a single database, then all the inter-table (object) references should be correct.

Perhaps we need to write some sort of "diff" comparisons to compare the contents of both the old and new databases. Or just say python to get all the objects from each model and write out .dict's on everything, and compare those.

Woseseltops commented 4 years ago

Thanks a lot for thinking along @susanodd !

Is it possible to just use a database viewer program and export the tables as SQL insert commands, and the SQL tables as CREATE commands?

Technically yes, that is the raw, low level way of doing it. What I am now experimenting with is a Django tool for things like this develop.py dumpdata and develop.py loaddata. Advantage of this would be that all things that will go wrong for everybody when migrating are already taken care of, like translation errors between Sqlite and MySQL. Can you think of an advantage of the low level way? I was thinking speed, but the exporting part took only a few minutes.

But if they are exported as SQL commands, then the data should retain its type. And if all the tables are exported from a single database, then all the inter-table (object) references should be correct.

I think all of this is also covered by the 5-step workflow I describe above?

Perhaps we need to write some sort of "diff" comparisons to compare the contents of both the old and new databases. Or just say python to get all the objects from each model and write out .dict's on everything, and compare those.

As some sort of test you mean, to check if we did not lose data?

Woseseltops commented 4 years ago

Update: I tried not listening to @susanodd, link the new MySQL database and do a develop.py migrate, but it results in a classic circular Signbank problem where it tries to create the database structure, but for that to work it already needs the structure to be in place. I'm going to try raw SQL next.

Woseseltops commented 4 years ago

This morning I tried the other approach, setting up the database structure already by making a schema structure export in SQLite, and then run this SQL in the new MySQL database... but I ran into various issues related to the differences between SQLite and MySQL. Some of these I could fix (MySQL does not seem to like double quotes, and calls AUTOINCREMENT AUTO_INCREMENT), but there were also a lot of things that I did not know how to fix.

I guess I go back to fixing the circular errors then.

Woseseltops commented 4 years ago

Okay, fixing the circular errors was not that bad. In all but one cases, the problematic code was already in a try ... except OperationalError... which just did not work in this case because now it counted as a django.db.utils.ProgrammingError instead of an OperationalError. I've reopened the issue where we tried to fix the circular errors so we can extend the solution: #564

After that, the database wanted to be created! I'm now trying to develop.py loaddata. The first thing I ran into was that during the creation of the tables Django already puts some data in there, which then create a conflict when importing other data. Fortunately I already read here: https://www.shubhamdipt.com/blog/django-transfer-data-from-sqlite-to-another-database/, that you can simply fix this with a

from django.contrib.contenttypes.models import ContentType
ContentType.objects.all().delete()

Now I get: Could not load sites.Site(pk=1): (1062, "Duplicate entry 'example.com'... would this also be data created by Django that can be removed?

susanodd commented 4 years ago

It sounds like Django data, since Site is a Django concept. That's weird there's an example.com.

susanodd commented 4 years ago

It sounds like chicken or egg.

susanodd commented 4 years ago

After first creating the new database, before importing the old stuff, can you export the new database and check what's in it? The linked article's database may not be as extensive as Signbank. There may be additional stuff that needs to be deleted.

Woseseltops commented 4 years ago

Your thinking was correct! When creating a new database structure with migrate, Django already puts some content types and and example site in the database. If you remove these again manually, the problem goes away. This is what I did to remove the example site

from django.contrib.sites.models import Site
Site.objects.all().delete()

Next problem was that there apparently are some duplicates in the dump json dat MySQL does not like. I'm going to write a small script to identify them automatically.

vanlummelhuizen commented 4 years ago

@Woseseltops Sharing my experience: the live instance of Mediate uses the same MySQL-server. I also have a development instance on another server with SQLite that is on the scratch disk (symlinked from writable/database). Here is the thing: this development instance seems faster than the live instance.

Woseseltops commented 4 years ago

Hmm... that is indeed good to know. Although I wouldn't have expected it, I can retroactively come up with a reason: MySQL adds some overhead in the form of a separate process that needs to send data over sockets, while SQLite always reads the data directly from disk. The advantage of MySQL is that it is optimized for larger datasets (I think), so there must be some dataset size threshold where the MySQL's speed outweighs its overhead. Do you think this could be true @vanlummelhuizen ?

vanlummelhuizen commented 4 years ago

@Woseseltops Honestly, I don't know. For Mediate there are over half a million records for some models, far more than any Signbank model (I think).

MySQL may be more stable than SQLite, so I am not against using it. The speed of that specific MySQL server may not be higher than SQLite from local disk though. Perhaps a local MySQL server in a container setup may help.

Of course the number of queries and indexing may help speed. But I think we should discuss that in #504.

Woseseltops commented 4 years ago

Problems that need to be fixed in the datadump:

I for now fixed this by removing problem objects, but we will need to investigate what these objects are before we do the migration for real.

Currently stuck at this error message:

django.db.utils.IntegrityError: Problem installing fixture '/var/www/signbank/live/writable/database/dump_cleaned.json': Could not load dictionary.Keyword(pk=2127): (1062, "Duplicate entry 'OK' for key 'text'")

As far as I can tell, there is only 1 'OK' keyword

Woseseltops commented 4 years ago

tenor (1)

It took a loooot of patience, but now Django agrees to load the data in the fixture into the MySQL database. More problems I ran into:

Woseseltops commented 4 years ago

So far I simply removed all database problem items. I have given it a bit more thought:

1 Translation duplicates

These are multiple translation objects that link the same gloss to the same keyword for the same language. I have investigated, and this is really garbage that can be removed:

image

2 The user profiles

I will create a separate script that can import existing user profile data into newly created user profiles.

3 The PK=0 gloss

This is https://signbank.science.ru.nl/dictionary/gloss/0/ JAN-WILLEM-VAN-MANSVELT. I guess we can remove this one and recreate it, hoping that it will get a valid private key this time? We also need to rewire all morphology, translations, frequency data and videos that refer to the pk=0 gloss:

{'fields': {'index': 0, 'language': 2, 'gloss': 0, 'translation': 3233}, 'model': 'dictionary.translation', 'pk': 5777}
{'fields': {'parent_gloss': 0, 'role': '2', 'morpheme': 3817}, 'model': 'dictionary.morphologydefinition', 'pk': 542}
{'fields': {'parent_gloss': 0, 'role': '3', 'morpheme': 1028}, 'model': 'dictionary.morphologydefinition', 'pk': 543}
{'fields': {'text': 'JAN-WILLEM-VAN-MANSVELT', 'gloss': 0, 'language': 2}, 'model': 'dictionary.annotationidglosstranslation', 'pk': 1634}
{'fields': {'text': 'JAN-WILLEM-VAN-MANSVELT', 'gloss': 0, 'language': 1}, 'model': 'dictionary.annotationidglosstranslation', 'pk': 6115}
{'fields': {'gloss': 0, 'document': 1195, 'speaker': 26, 'frequency': 3}, 'model': 'dictionary.glossfrequency', 'pk': 18892}
{'fields': {'version': 0, 'gloss': 0, 'videofile': 'glossvideo/NGT/JA/JAN-WILLEM-VAN-MANSVELT-0.mp4'}, 'model': 'video.glossvideo', 'pk': 1483}
vanlummelhuizen commented 4 years ago

The 'OK' problem mentioned above, which was caused by MySQL comparing string non-case sensitive by default. I fixed it by making this specific column case sensitive: ALTER TABLE dictionary_keyword CHANGE 'text' 'text' VARCHAR(100) CHARACTER SET utf8 COLLATE utf8_bin;

@Woseseltops Do you know that in MySQL utf8 is not really utf8. MySQL-utf8 can hold max 3 bytes while real utf8 can be 4 bytes. utf8mb4 is MySQL's real utf8. See for example: https://www.eversql.com/mysql-utf8-vs-utf8mb4-whats-the-difference-between-utf8-and-utf8mb4/

Woseseltops commented 4 years ago

No I didn't know that @vanlummelhuizen ! I've looked it up and UTF-8 uses 2-3 bytes for ASCII characters, but I'm not so sure for the Chinese characters. Sounds like it would be a good idea to use real UTF-8 by default. It's not a simple thing to do, apparently, however :( . https://dba.stackexchange.com/questions/8239/how-to-easily-convert-utf8-tables-to-utf8mb4-in-mysql-5-5

Woseseltops commented 3 years ago

Okay, the script to install user profiles is working. Next up: creating a command that converts all tables and columns to real UTF-8

Woseseltops commented 3 years ago

Note to self: useful overview of all columns: select table_name,column_name,character_set_name from information_schema.columns;

susanodd commented 3 years ago

There is a lot of stuff with weird encodings in the GlossVideo table. When moving to MySQL can something be done about this?

With special utf-8 characters in filenames, as well as filenames with percent encoding, it causes problems to upload a new video. The code silently "does nothing" about the paths it can't handle. It's not possible to get rid of stuff in GlossVideo, so it keeps finding the "first" object in the list and not being able to do anything with its path. (#674)

(This is causing a problem with Farsi, also because concatenation works backwards.)

Woseseltops commented 3 years ago

Look at this MySQL-monster:

ALTER TABLE auth_user_user_permissions CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_gloss_morphemePart CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE feedback_generalfeedback CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_keyword CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_lemmaidglosstranslation CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE django_summernote_attachment CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_blendmorphology CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_corpus CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE feedback_signfeedback CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE tagging_taggeditem CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE pages_page CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE video_glossvideo CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE attachments_attachment CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_gloss CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_language CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE auth_user CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_annotationidglosstranslation CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_relation CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE video_video CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_glossfrequency CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE feedback_missingsignfeedback CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE tagging_tag CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_gloss_dialect CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE django_session CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_morpheme CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE django_admin_log CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE auth_permission CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE guardian_userobjectpermission CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_gloss_creator CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE auth_group CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_dataset_owners CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_gloss_signlanguage CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_signlanguage CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_speaker CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_userprofile CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE django_site CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE pages_page_group_required CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE feedback_interpreterfeedback CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE reversion_version CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_simultaneousmorphologydefinition CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE pages_pagevideo CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE guardian_groupobjectpermission CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_morphologydefinition CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_userprofile_selected_datasets CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_othermedia CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_deletedglossormedia CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_dataset CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE auth_user_groups CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE django_content_type CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE auth_group_permissions CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE django_migrations CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_document CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE reversion_revision CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_dataset_exclude_choices CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_glossrevision CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_definition CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_fieldchoice CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_lemmaidgloss CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE video_glossvideohistory CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_translation CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_dialect CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_relationtoforeignsign CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_handshape CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE dictionary_dataset_translation_languages CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

According to the StackOverflow answer I'm using all columns should also be converted individually, making the command even larger. However, if I use select table_name,column_name,character_set_name,collation_name,column_type from information_schema.columns; everything is looking perfect already, so I think this can be skipped.

susanodd commented 3 years ago

OMG!

Woseseltops commented 3 years ago

Okay the database is UTF-8 ready. Next week, I will:

Woseseltops commented 3 years ago

Recreate the gloss for Jan-Willem Mansveld. See if a newly created gloss after that does not get pk 0. Rewire all related objects.

Done, all seems okay. We now have https://signbank.science.ru.nl/dictionary/gloss/38194/

Woseseltops commented 3 years ago

I've added a database switch to the settings, so I can quickly switch from MySQL and Sqlite (with the latter being the real, up to date database).

Unfortunately, when I switch to MySQL, Signbank does not even want to start. I get this error:

ImportError: No module named 'MySQLdb.constants'

Which is weird, my MySQLdb is definitely there. I know this, because I can use MySQL-related functionality (like loading data into it), and also this test strongly suggests it:

(sb-env) [wstoop@applejack:/var/www/signbank/live]$ python
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
=
>>> import MySQLdb
>>>

Any idea what's going on here, @vanlummelhuizen and @susanodd ?

vanlummelhuizen commented 3 years ago

@Woseseltops I have tried some stuff and googled for it, but unfortunately I have not found a possible solution.

Woseseltops commented 3 years ago

Turns out there are multiple Python MySQL packages for different Python versions, that have different names when using pip vs import; Django seems to support them all (I think). An attempt to take away my own confusion:

pip install import note
MySQL-python MySQLdb Python 2 only
mysqlclient MySQLdb fork of the previous
PyMySQL pymysql pure Python, has function to 'act as MySQLdb' (?)
Woseseltops commented 3 years ago

Today, tried to install PyMySQL instead of MySQL, using its option 'install_as_MySQLdb'. Like with the other package, it works when you try it separately (in the Django shell), but when I run the actual application, I get:

AttributeError: module 'pymysql' has no attribute 'install_as_MySQLdb'

As similar kind of error, like it didn't load the module correctly or something. I'm now going to test the hypothesis that it's related to uwsgi/Emperor.

Woseseltops commented 3 years ago

Ha! It's emperor related! If I run the uwsgi-process by hand, I get a new error Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock ! Progress!

ocrasborn commented 2 years ago

What's the status of this issue, @Woseseltops ?

susanodd commented 1 year ago

Can this be closed?

vanlummelhuizen commented 1 year ago

@susanodd I think using MySQL is not yet possible. @Woseseltops correct?

vanlummelhuizen commented 1 year ago

@Woseseltops Any progress here?