Open Yueqiao12Zhang opened 1 month ago
Week 5/21: From MusicBrainz, download raw data, convert the raw data to a python-readable format, then convert to csv, reconcile using OpenRefine, apply to the original script.
"artist-credit": [
{
"name": "Pigface",
"joinphrase": "",
"artist": {
"type-id": "e431f5f6-b5d2-343d-8b36-72607fffb74b",
"genres": [
{
"disambiguation": "",
"count": 1,
"id": "89255676-1f14-4dd8-bbad-fca839d6aff4",
"name": "electronic"
},
{
"disambiguation": "",
"id": "ffbc9907-c9be-4ace-876b-b7fd5b9d51f9",
"count": 4,
"name": "industrial rock"
}
],
"tags": [
{
"name": "american",
"count": 1
},
{
"name": "electronic",
"count": 1
},
{
"name": "industrial rock",
"count": 4
},
{
"count": 2,
"name": "supergroup"
}
],
"aliases": [],
"type": "Group",
"id": "11137c88-a9a2-4ffa-a97d-fb058c6d6ce2",
"disambiguation": "industrial supergroup",
"sort-name": "Pigface",
"name": "Pigface"
}
}
],
This is one attribute with multiple values from an instance of an "recording". Since it is a nested list/dictionary, it can be infinitely long. It is impossible to give each single value a column in the CSV file. How do we interpret this?
My idea is to use an algorithm that keeps track of only the "id" and "name" values recursively and give them columns in the CSV.
For example: the artist with id 001 has the following attribute in the JSON file:
"genres": [
{"id": "1",
"name": "rnb"
},
{"id": "2",
"name": "jazz"
},
{"id": "3",
"name": "soul"
}
]
Goes to the CSV file:
artist_id,genre_id,,,genre_name,,
001,1,2,3,"rnb","jazz","soul"
First, make a separate issue about this. Then look for standard solutions for this. Document possible solutions, e.g., RDF: Containers and Collections. Implement one or more solutions then experiment with potential implications when querying. I would start with just one of the standard solutions. Make sure to document other possible choices here and why you chose one of them, so that if the chosen solution is not ideal when querying, we can come back here and try other solutions. BTW, these lists could be long but never infinite.
On May 23, 2024, at 23:19, Yueqiao Zhang @.***> wrote:
"artist-credit": [ { "name": "Pigface", "joinphrase": "", "artist": { "type-id": "e431f5f6-b5d2-343d-8b36-72607fffb74b", "genres": [ { "disambiguation": "", "count": 1, "id": "89255676-1f14-4dd8-bbad-fca839d6aff4", "name": "electronic" }, { "disambiguation": "", "id": "ffbc9907-c9be-4ace-876b-b7fd5b9d51f9", "count": 4, "name": "industrial rock" } ], "tags": [ { "name": "american", "count": 1 }, { "name": "electronic", "count": 1 }, { "name": "industrial rock", "count": 4 }, { "count": 2, "name": "supergroup" } ], "aliases": [], "type": "Group", "id": "11137c88-a9a2-4ffa-a97d-fb058c6d6ce2", "disambiguation": "industrial supergroup", "sort-name": "Pigface", "name": "Pigface" } } ],
This is one attribute with multiple values from an instance of an "artist". Since it is a nested list/dictionary, it can be infinitely long. It is impossible to give each single value a column in the CSV file. How do we interpret this?
— Reply to this email directly, view it on GitHubhttps://github.com/DDMAL/linkedmusic-datalake/issues/45#issuecomment-2127246898, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAH342KSP67VCHCKELJ3JN3ZDX3HZAVCNFSM6AAAAABH2HCJAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRXGI2DMOBZHA. You are receiving this because you were mentioned.Message ID: @.***>
Can you put a hold on these tasks? I don't know what this ER schema is. You can explain this to me on Friday.
Can you work on importing some MusicBrainz data then reconcile it using OpenRefine? Make an issue for that.
Originally posted by @fujinaga in https://github.com/DDMAL/linkedmusic-datalake/issues/43#issuecomment-2113071989