thesession data processing

Yueqiao12Zhang commented 5 months ago

Different approaches to process the session raw CSVs:

fetch all CSVs to local, reconcile them one by one, reorder the columns by renaming and making the ids for each CSV their first column, then use the csv2rdf process to merge all into one RDF.
fetch all CSVs to local, use Van's code, join them by tune_id, then reconcile the one merged file directly, and use the csv2rdf process to convert it to RDF.

Comparison: The joining process in option 2 takes a lot of effort. Van's code expands all the CSVs horizontally, making one row for one tune, and merges all the CSVs by tune_id. Although there much less number of rows, this makes the header extremely long. For option 2, it's almost impossible to reconcile using OpenRefine. Since there is no reconciliation in The Session and there are thousands of rows, we have to reconcile one by one. In option 1, we reconcile the raw CSVs directly. Since the data is still vertical, there are small number of rows, which is easy to reconcile. Then we rename each id to {entity_type}_id, and we go to csv2rdf directly. It can merge all CSV into RDF in one operation. I think this is much more convenient.

Yueqiao12Zhang commented 5 months ago

The entities that are not reconciled well:

Events: event(7258/7293 not reconciled), venue,
Tunes: name
Aliases: I don't think Aliases should be reconciled, since the importance lies on their Strings.
Recordings: the recording artists are not identified by their ids, but their names. It is difficult to reconcile them with Wikidata, I'm wondering how can I get the The Session link to the artists. recording column is also difficult to reconcile.
Sessions: names,

Yueqiao12Zhang commented 5 months ago

In events and sessions CSV, there are three columns about the address: Country, Area, and Town. Country and Area can be reconciled well, but as the address gets more specific, there are many duplicates in the Town column since there are many towns with the same name in a single country. In this case, I think it's not easy to make the reconciliation procedure automatic. There must be human inspection on the reconciliation data.

dchiller commented 5 months ago

I'm wondering how can I get the The Session link to the artists.

You mean something like this? https://thesession.org/recordings/artists/319

Yueqiao12Zhang commented 5 months ago

I'm wondering how can I get the The Session link to the artists.

You mean something like this? https://thesession.org/recordings/artists/319

Yes, but they only have the name for the artist in the CSV.

Yueqiao12Zhang commented 5 months ago

For sessions and events, since there are many towns/areas with the same name in different countries, we should inspect the reconciliation data carefully. My procedures maximizes automation, but there should still be some data that needs inspection. In these CSVs, their names, address, and venue in events have very low reconcile rate in wikidata. Should I still reconcile them?

fujinaga commented 5 months ago

No need to reconcile the location if there're any ambiguities. Do make a note in the documentation of the importing process for each database.

Yueqiao12Zhang commented 5 months ago

Generate the RDF for Virtuoso is the last step for this issue.

fujinaga commented 4 months ago

Whenever OpenRefine cannot automatically reconcile an item (i.e., cannot assign an URI), you can leave the item as a string literal, in which case, do add an language tag (e.g., @en) or data (e.g., a number or a date), in which case, do add the data type (e.g., http://www.w3.org/2001/XMLSchema#date).

DDMAL / linkedmusic-datalake

thesession data processing #68