DDMAL / linkedmusic-datalake

To create mapping strategies for various music databases into our data lake
https://virtuoso.staging.simssa.ca
0 stars 4 forks source link

Converting a json file of MusicBrainz data with nested lists/dictionaries to a finite CSV file #48

Closed Yueqiao12Zhang closed 5 months ago

Yueqiao12Zhang commented 5 months ago

Convert a sample of MusicBrainz data (exported in a json format) into a csv file. MusicBrainz json may contain arbitrary nesting of objects and arrays, as in the example below.

"artist-credit": [
            {
                "name": "Pigface",
                "joinphrase": "",
                "artist": {
                    "type-id": "e431f5f6-b5d2-343d-8b36-72607fffb74b",
                    "genres": [
                        {
                            "disambiguation": "",
                            "count": 1,
                            "id": "89255676-1f14-4dd8-bbad-fca839d6aff4",
                            "name": "electronic"
                        },
                        {
                            "disambiguation": "",
                            "id": "ffbc9907-c9be-4ace-876b-b7fd5b9d51f9",
                            "count": 4,
                            "name": "industrial rock"
                        }
                    ],
                    "tags": [
                        {
                            "name": "american",
                            "count": 1
                        },
                        {
                            "name": "electronic",
                            "count": 1
                        },
                        {
                            "name": "industrial rock",
                            "count": 4
                        },
                        {
                            "count": 2,
                            "name": "supergroup"
                        }
                    ],
                    "aliases": [],
                    "type": "Group",
                    "id": "11137c88-a9a2-4ffa-a97d-fb058c6d6ce2",
                    "disambiguation": "industrial supergroup",
                    "sort-name": "Pigface",
                    "name": "Pigface"
                }
            }
        ],

This is one attribute with multiple values from an instance of an "recording". Since it is a nested list/dictionary, it can be infinitely long. It is impossible to give each single value a column in the CSV file. How do we interpret this?

Originally posted by @Yueqiao12Zhang in https://github.com/DDMAL/linkedmusic-datalake/issues/45#issuecomment-2127246898

Yueqiao12Zhang commented 5 months ago

The reason behind this PR is: OpenRefine has bugs reading the JSON Lines format file downloaded directly from MusicBrainz. If we want to use the reconciliation function from OpenRefine, using a CSV or a TSV format is the most efficient. Therefore, I want to first convert the JSON Lines format to a flattened CSV that can be reconciled using OpenRefine.

dchiller commented 5 months ago

I have maybe a silly question:

If we want to data to end up in a csv format for reconciliation in open refine, why are we starting with the JSON dump from MusicBrainz and not from the Postgres dump?

Yueqiao12Zhang commented 5 months ago

I have maybe a silly question:

If we want to data to end up in a csv format for reconciliation in open refine, why are we starting with the JSON dump from MusicBrainz and not from the Postgres dump?

To import these Postgres dumps, we need a compatible version of the MusicBrainz server software. I think trying the JSON format is easier for me since it's more familiar to me.

dchiller commented 5 months ago

Ok, I see.

Yueqiao12Zhang commented 5 months ago

New approach: if there are several elements to a key, we append it to another row of the CSV.

Yueqiao12Zhang commented 5 months ago

In another issue, @ahankinson mentioned my procedures for reconciling CSVs in OpenRefine. I uploaded detailed history JSON files that are exported directly from OpenRefine. They contain all the steps I went over to reconcile.

I will also describe my steps: For all columns with "id", I replaced them with the [https://musicbrainz.org/{entity_type}/{id}] reference. They are direct links to MusicBrainz web pages. For other columns with "names", "title", "genre names", etc, I reconciled them with the Wikidata reconciliation service. If there is no perfect match, I go to the original page in MusicBrainz and check if there is a Wikidata link since some of them have a different name or cannot be found by the reconciliation service, and I will not reconcile that cell if there's none. After all the reconciliation procedures, I will add a column beside each reconciled column with all the reconciled Wikidata URLs. This is ready to export, after we export, we have to go write the mapper file for the RDF conversion. Copy the header to the relationsmapping{database name}.json, change it into a JSON format with a single dictionary, each column in the header as a key in the dictionary, and fill their values using Wikidata Property links, Wikidata Instance links, Schema.org links, or MusicBrainz documentation links (preferred from best to worst respectively). Then we run the csv2rdf_single_subject.p, we will get an out_rdf.ttl, and this is ready to be imported into Virtuoso. We go to Conductor > Linked Data > Quad Store Upload, select the out_rdf.ttl file, give it a name, check the "create graph explicitly" and upload. We can check if the file is successfully uploaded in the Linked Data > Graphs > Graphs section. If it is there, then we can go to the Linked Data > SPARQL, enter the name we gave to the graph in the Default Graph IRI, and perform SPARQL queries.

dchiller commented 5 months ago

In another issue, @ahankinson mentioned my procedures for reconciling CSVs in OpenRefine. I uploaded detailed history JSON files that are exported directly from OpenRefine. They contain all the steps I went over to reconcile.

I am sorry, but I am going to be a stickler about this for a bit: this is an issue about taking a json file of MusicBrainz data and turning it into a csv file. This is an unrelated discussion so does not belong in this issue.