Import Discogs' data dumps into Postgres efficiently.
Uses Hexpat for parsing XML and Postgres' COPY to store the result.
Inspired by discogs-xml2db.
I wanted Discogs' data in a Postgres database and I had some issues with the existing solution:
The projects uses stack so it should be as easy as:
stack install
and adding stack's binary path to your PATH (if you haven't already).
Supposing we downloaded the 20150810 dump:
and we didn't decompress them, then we can run on the repo's directory:
$ createdb discogs
$ psql discogs < sql/tables.sql
$ discogs2pg -g -d 20150810 -c dbname=discogs
$ # wait an hour or two ...
$ psql discogs < sql/indexes.sql
discogs2pg can has two forms of operation, by file and by date.
You can import a single file with discogs2pg -c <blah> some_releases.xml
or
import releases, artists, labels and master for a date DATE
in the current directory with discogs2pg -c <blah> -d DATE
.
Also the -g | --gzip
option lets you import the compressed files directly.
If you want it to be even faster, and your computer can handle it, you can pass an optional --aggressive
to make it open all file in parallel. Consider that since we are using COPY it will open a
connection per table: that's 15 connections at once.
I don't have a Windows machine to build and test the binary, so I'd love it if someone could get into that.
Also, I can think of two things:
I'd be glad to hear suggestions and take pull requests, of course.