DDMAL / linkedmusic-datalake

0 stars 4 forks source link

Importing MusicBrainz data and reconcile it using OpenRefine. #45

Open Yueqiao12Zhang opened 1 month ago

Yueqiao12Zhang commented 1 month ago

Can you put a hold on these tasks? I don't know what this ER schema is. You can explain this to me on Friday.

Can you work on importing some MusicBrainz data then reconcile it using OpenRefine? Make an issue for that.

Originally posted by @fujinaga in https://github.com/DDMAL/linkedmusic-datalake/issues/43#issuecomment-2113071989

Yueqiao12Zhang commented 1 month ago

Week 5/21: From MusicBrainz, download raw data, convert the raw data to a python-readable format, then convert to csv, reconcile using OpenRefine, apply to the original script.

Yueqiao12Zhang commented 1 month ago
"artist-credit": [
            {
                "name": "Pigface",
                "joinphrase": "",
                "artist": {
                    "type-id": "e431f5f6-b5d2-343d-8b36-72607fffb74b",
                    "genres": [
                        {
                            "disambiguation": "",
                            "count": 1,
                            "id": "89255676-1f14-4dd8-bbad-fca839d6aff4",
                            "name": "electronic"
                        },
                        {
                            "disambiguation": "",
                            "id": "ffbc9907-c9be-4ace-876b-b7fd5b9d51f9",
                            "count": 4,
                            "name": "industrial rock"
                        }
                    ],
                    "tags": [
                        {
                            "name": "american",
                            "count": 1
                        },
                        {
                            "name": "electronic",
                            "count": 1
                        },
                        {
                            "name": "industrial rock",
                            "count": 4
                        },
                        {
                            "count": 2,
                            "name": "supergroup"
                        }
                    ],
                    "aliases": [],
                    "type": "Group",
                    "id": "11137c88-a9a2-4ffa-a97d-fb058c6d6ce2",
                    "disambiguation": "industrial supergroup",
                    "sort-name": "Pigface",
                    "name": "Pigface"
                }
            }
        ],

This is one attribute with multiple values from an instance of an "recording". Since it is a nested list/dictionary, it can be infinitely long. It is impossible to give each single value a column in the CSV file. How do we interpret this?

Yueqiao12Zhang commented 1 month ago

My idea is to use an algorithm that keeps track of only the "id" and "name" values recursively and give them columns in the CSV.

Yueqiao12Zhang commented 1 month ago

For example: the artist with id 001 has the following attribute in the JSON file:

"genres": [
{"id": "1", 
"name": "rnb"
},
{"id": "2",
"name": "jazz"
},
{"id": "3",
"name": "soul"
}
]

Goes to the CSV file:

artist_id,genre_id,,,genre_name,,
001,1,2,3,"rnb","jazz","soul"
fujinaga commented 1 month ago

First, make a separate issue about this. Then look for standard solutions for this. Document possible solutions, e.g., RDF: Containers and Collections. Implement one or more solutions then experiment with potential implications when querying. I would start with just one of the standard solutions. Make sure to document other possible choices here and why you chose one of them, so that if the chosen solution is not ideal when querying, we can come back here and try other solutions. BTW, these lists could be long but never infinite.

On May 23, 2024, at 23:19, Yueqiao Zhang @.***> wrote:



"artist-credit": [ { "name": "Pigface", "joinphrase": "", "artist": { "type-id": "e431f5f6-b5d2-343d-8b36-72607fffb74b", "genres": [ { "disambiguation": "", "count": 1, "id": "89255676-1f14-4dd8-bbad-fca839d6aff4", "name": "electronic" }, { "disambiguation": "", "id": "ffbc9907-c9be-4ace-876b-b7fd5b9d51f9", "count": 4, "name": "industrial rock" } ], "tags": [ { "name": "american", "count": 1 }, { "name": "electronic", "count": 1 }, { "name": "industrial rock", "count": 4 }, { "count": 2, "name": "supergroup" } ], "aliases": [], "type": "Group", "id": "11137c88-a9a2-4ffa-a97d-fb058c6d6ce2", "disambiguation": "industrial supergroup", "sort-name": "Pigface", "name": "Pigface" } } ],

This is one attribute with multiple values from an instance of an "artist". Since it is a nested list/dictionary, it can be infinitely long. It is impossible to give each single value a column in the CSV file. How do we interpret this?

— Reply to this email directly, view it on GitHubhttps://github.com/DDMAL/linkedmusic-datalake/issues/45#issuecomment-2127246898, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAH342KSP67VCHCKELJ3JN3ZDX3HZAVCNFSM6AAAAABH2HCJAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRXGI2DMOBZHA. You are receiving this because you were mentioned.Message ID: @.***>