DDMAL / linkedmusic-datalake

To create mapping strategies for various music databases into our data lake
https://virtuoso.staging.simssa.ca
0 stars 4 forks source link

How to convert a reconciled CSV file into a RDF graph file in turtle format? #40

Closed candlecao closed 5 months ago

candlecao commented 5 months ago

"Although uploading CSV files to Virtuoso is straightforward, the process requires some customization before the data can be queried using SPARQL with custom URIs, which are also regarded as defined ontology’s vocabulary. This necessitates further investigation.

Alternatively, we can convert CSV files into Turtle format using a Python library such as RDFLib, or with assistance of ChatGPT. We then upload the Turtle files to Virtuoso. This raises the question: Is it still necessary to assert the ontology’s vocabulary explicitly?"

This revision aims to clarify the process and the considerations involved in working with CSV files and Virtuoso, as well as the role of ontology vocabulary in this context.

dchiller commented 5 months ago

Hi @candlecao! Is this a CantusDB-related issue? Just looking at it right now, it seems like there might be a better repository for this!

annamorphism commented 5 months ago

Hi @candlecao! Is this a CantusDB-related issue? Just looking at it right now, it seems like there might be a better repository for this!

perhaps linkedmusic-datalake?

candlecao commented 5 months ago

Yes. It's for LinkedMusic.

Yueqiao12Zhang commented 5 months ago

Week of May 10th:

Yueqiao12Zhang commented 5 months ago

Day 1: reading OpenRefine document and generating a correct output CSV file. start working on the script

Yueqiao12Zhang commented 5 months ago

Day 1: There are progresses in the script. It works in the way I expected. The test output CSV file generated using OpenRefine needs to be reviewed. A little mistake: the pull request from last week was not merged. I realized it after I pushed the new changes, but the unreviewed changes were merged together with the reviewed changes.

Yueqiao12Zhang commented 5 months ago

Day 2: The Python script is optimized and now works properly with the OpenRefine output CSV file. There is some uncertainty with the mapper due to the definition of the predicate, it needs to be considered. To make it compatible with other CSV files, the CSV file must be correctly reconciled, and the URI of the predicate relations must be specified in the mapper file. Finding other ways of reconciling the CSV since the service reconcile function does not always work properly in databases other than Wikidata. Day 3: fix any other potential problems in the Python codes. read how the MusicBrainz database export to a CSV file that can be correctly processed.

Yueqiao12Zhang commented 5 months ago

The reason behind conversion to CSV: CSV is much simpler to reconcile using OpenRefine. If the reconciliation process is automated, we can try to reconcile directly using JSON files.

ahankinson commented 5 months ago

How are you reconciling things in openrefine? Against what services? Turtle format and OpenRefine are very different, and I don’t really have a good sense of what your actual reconciliation process looks like once it’s in openrefine

Yueqiao12Zhang commented 5 months ago

For example: in MusicBrainz data, there are objects and ids of an object. The id can be directly appended to https://musicbrainz.org/{entity_type}/{id}, which is the MusicBrainz web page of that instance. Other literals such as strings and names are reconciled using Wikidata reconciliation API and against their types, for example, "single" for titles of recordings, "human" for musicians, etc.

ahankinson commented 5 months ago

in MusicBrainz data, there are objects and ids of an object. The id can be directly appended to https://musicbrainz.org/{entity_type}/{id}, which is the MusicBrainz web page of that instance.

Yes, I know that. But I'm not sure where OpenRefine fits in that process. Are you simply using OpenRefine to append strings together?

It took me several weeks to reconcile two moderately large datasets, so I'm surprised that you were able to reconcile them similarly in just a few hours.

Yueqiao12Zhang commented 5 months ago

For all MusicBrainz ids, I just append the reference to the beginning using OpenRefine "edit column". For Wikidata services, I used the "slow" reconciliation function. It's taking a fairly long time.

The test files I used are short clips I extracted from the large data dumps. They are a maximum of 100 records long.

In the new reconcile.py that I wrote for another branch, the MusicBrainz appending process is automated, and it also checks the validity of the URL. This reconcile.py also reconciles the literals against the WikiData reconciliation API, but it is not concise.

ahankinson commented 5 months ago

Looking here: https://github.com/DDMAL/linkedmusic-datalake/blob/e97283784d8468597238de6c4d5e3c5a5f8476b3/musicbrainz/csv/reconcile.py

I'm concerned that there's a divergence of understanding of what "reconcile" means in the context of OpenRefine.

Reconciliation in OpenRefine means matching objects across datasets using a Reconciliation API service.

https://openrefine.org/docs/technical-reference/reconciliation-api

This means identifying that this person:

https://rism.online/people/11035

and this person

https://www.wikidata.org/wiki/Q255

And this person:

https://musicbrainz.org/artist/1f9df192-a621-4f54-8850-2c5373b7eac9

Are the same person.

You can do dataset normalization in OpenRefine, which means doing things like string appending to clean up your data and make it consistent. But you don't really need OpenRefine to do that.

It looks like you're doing an HTTP request to verify to MusicBrainz that the page exists. That's also not reconciliation, that's just checking for broken links...

I'm also not entirely sure what you're doing with Wikidata API calls. Are you just using the API to look up the Q number? You can get the data just by structuring your URL like this:

https://www.wikidata.org/wiki/Special:EntityData/Q255.json

The documentation here has more details: https://www.wikidata.org/wiki/Wikidata:Data_access#Linked_Data_Interface_(URI)

Maybe I'm missing something, and I don't actually know what you've been tasked to do, but at the very least we need clarity that you're not actually reconciling things in OpenRefine, so that shouldn't be the word that's being used.

Yueqiao12Zhang commented 5 months ago

For now, I will just reconcile everything in OpenRefine, and I will put the reconcile.py on hold. The main task to do now is to convert the reconciled CSV to turtle, and upload them to Virtuoso.

Yueqiao12Zhang commented 5 months ago

I will look into these informations.

ahankinson commented 5 months ago

I should also add that Reconciliation can mean that you use the same form of data, so if you have a data set that has "Ludwig Van Beethoven", "Van Beethoven, Ludwig", and "Beethoven, Ludwig Van", you can use an external service to choose one form of the name across your entire dataset, and match them to that external service.

But I suspect that is less of a problem for your data.

Yueqiao12Zhang commented 5 months ago

Yes, in databases I've worked with, since they all refer to the same page in database web pages, their names are the same.

candlecao commented 3 months ago

In addition, we summarize different ways as to CSV2RDF. See different ways as to CSV2RDF