acoustid / mbdata

MusicBrainz SQLAlchemy Models
MIT License
97 stars 23 forks source link

musicbrainz.release table restoration takes too long probably because of triggers #64

Closed roughnemezis closed 1 year ago

roughnemezis commented 1 year ago

I first use the mbslave command to restore musicbrainz db from a dump on my computer, since it took ages (didn't finish in a day) I dig in the code to narrow it down to restoring the release table from a downloaded db dump

from mbdata.replication import Config

config = Config(['mbslave.conf'])
db = config.connect_db()
filename = "mbdump/release"
schema, table = "musicbrainz", "release"

cursor = db.cursor()
with open(filename, 'r') as f:
    cursor.copy_expert('COPY {} FROM STDIN'.format("musicbrainz.release"), f)
db.commit()

Still very long - after 4 hours the data is not inserted

If I disable the triggers on this table:

cursor = db.cursor()
cursor.execute("ALTER TABLE musicbrainz.release DISABLE TRIGGER ALL")
with open(filename, 'r') as f:
    cursor.copy_expert('COPY {} FROM STDIN'.format("musicbrainz.release"), f)
cursor.execute("ALTER TABLE musicbrainz.release ENABLE TRIGGER ALL")

the restoration takes a little more than a minute.

Doing so before restoring each table might speed up the restoration process by a few hundreds, there might be a reason not to do it that is beyond the reach of my thinking :-|

Is it doable?

PS: in postgresql log file there's also a mention of WAL writing occuring too frequently, so a 'SET UNLOGGED' on each table before restoring might be a good idea, but I have no idea if it affects the performances dramatically

lalinsky commented 1 year ago

How did you setup the DB? You are not supposed to have any triggers, indexes or primary keys created when importing data.

roughnemezis commented 1 year ago

wow you're fast...

mbslave init --create_user --create-database --empty

with my db conf like so:

[database]
host=127.0.0.1
port=5432
name=musicbrainz
user=musicbrainz
password=musicbrainz
admin_user=postgres
admin_password=postgres

then I looped over the files of the dump using copy_expert

I did so because my first attempt to clone the db using mbslave init --create_user --create-database (without the --empty option) didn't finish in an afternoon

what did I do wrong?

roughnemezis commented 1 year ago

I the mbslave_init_main function in the code and it appears that the triggers are created after importing the data. My apologies.

My first attempt to : mbslave init --create_user --create-database might have taken a very long time because of my poor network, that's why I tried to first download the archive and then import

Could be handy to have an option to manually download the db dump and to restore the database from a file in mbslave init?

Anyway, I'm closing this issue

lalinsky commented 1 year ago

@roughnemezis Yeah, that would be a good addition. Previously, everything was done manually, so it was more clear what is happening. But it definitely makes sense to download the files first and then let mbslave init complete the DB setup

orkenstein commented 1 year ago

I first use the mbslave command to restore musicbrainz db from a dump on my computer, since it took ages (didn't finish in a day) I dig in the code to narrow it down to restoring the release table from a downloaded db dump

Could you please provide more details how to actually replicate MB Postgres DB with a dump? Because the dump doesn't look like a proper Postgres pg_dump files.