culturecreates / artsdata-orion

Collection of data sources loaded into Artsdata by Culture Creates
0 stars 0 forks source link

Template workflow for Ontorefine #3

Closed saumier closed 8 months ago

saumier commented 10 months ago

The goal is to create an example workflow that uses Ontorefine to transform data from a csv to turtle format. The workflow should have a manual trigger and be stored in the main branch of "artsdata-planet-unmanaged"

Steps:

  1. Checkout main branch with local csv file and json mapping file
  2. Start docker image for Ontorefine
  3. Use ontorefine-cli with the command transform passing in the csv filename, mapping filename.
  4. Save the output turtle file to the Github repo.
saumier commented 9 months ago

@sahalali This is the issue that I mentioned this morning (your Wednesday evening). Any help is welcome. I am using OntoRefine for ETL from Ville de Laval to Artsdata, and I have the mapping.json. But until I get this webflow to work I need to run it manually.

sahalali commented 9 months ago

How can I help you with this? Can I take up this issue and prepare a script to run it?

saumier commented 9 months ago

@sahalali You can advance this by working on https://github.com/culturecreates/footlight-aggregator/issues/72 and https://github.com/culturecreates/footlight-aggregator/issues/65

sahalali commented 8 months ago

@saumier The file is not stored in Git Hub for now. The RDF file is stored in S3.

I've updated the onto refine config and pushed the latest events from tout-culture to Artsdata. @saumier Can you please review it and let me know any changes required.

I think I need to add support to upload places, person and organization next. And add more details liek performer, supported,etc to events.

saumier commented 8 months ago

@sahalali This is looking much better. To fix:

Next I would work on Place, since those are really important for minting artdata URIs especially the property "containedInPlace". You might also want to look at how I uploaded regions using schema:additionalProperty [ a schema:LocationFeatureSpecification ].

Once Places are uploaded I can delete my graphs that I upload with the spreadsheet plugin.

saumier commented 8 months ago

@sahalali Longitude, Latitude are good, and it matches the current properties in the table of the model documentation. In the doc, "geo" is recommended and it will slowly become more used, but at the moment most places don't use it yet. postalCode is also good.

Let me know when you are done with places, so I will delete my graphs that I manually upload each week. Don't forget to assign this issue back to me if you questions or if you want me to review. Thx.

sahalali commented 8 months ago

@saumier I've added a transform configuration for Places. However while trying to export CMS Places to Artsdata, the Artsdata API throws an exception : unknown RDF format: {:base_uri=>"https://footlight-aggregator-pipeline-files.s3.ca-central-1.amazonaws.com/entities.ttl", :content_type=>"binary/octet-stream", :file_name=>"https://footlight-aggregator-pipeline-files.s3.ca-central-1.amazonaws.com/entities.ttl"} This may be resolved with a require of the 'linkeddata' gem.

As a part of debugging, I tried uploading the file transformed using Onto Refine into a local GraphDB and the same file can be uploaded locally and that works.

Can you please help me to find the root cause of this issue?

saumier commented 8 months ago

@sahalali The root cause is the unknown file format. This error is triggered by the Databus because it cannot figure out the Content-Type of the file to download.

The Content-Type header it receives is "binary/octet-stream" which is incorrect because the file is a text file and not binary. In this case the code still tries to guess the format. The Databus code uses RDF::Graph.load which has several internal ways to guess at the format, including a look at the file extension. This works with .json files, but is not fail safe. It can also try to look at the first lines of the file, but I actually don't know the full algorithm inside RDF::Graph.load nor why it works when you send events and not when you send places.

I suggest you either switch to .json and serialize with JSON-LD or else set the correct Content-Type when you upload to S3 for turtle files.

To investigate you can look at the Content-Type metadata in the S3 console. S3 is also trying to guess it by looking at the file extension.

Unrelated, I noticed that you are overwriting the file on S3 each time. Instead you should use the version (using the data stamp) and write different files to S3. The difference between creating a file and replacing a file maybe is related to why it sometimes works and not other times.

saumier commented 8 months ago

@sahalali I tested my theory by changing the metadata in the S3 bucket for Content-Type to text/turtle and I was able to paste your url https://footlight-aggregator-pipeline-files.s3.ca-central-1.amazonaws.com/entities.ttl into Nebula viewer here and click "view" to load the data and display it. So now I think it will work with the Artsdata Databus as well.

sahalali commented 8 months ago

Updated the workflow

  1. to correct Content-Type.
  2. to avoid rewriting the transformed file.
sahalali commented 8 months ago

@saumier Can we delete the file uploaded to S3 as the final step of the workflow execution? Otherwise can you please suggest how often should we clear the bucket?

saumier commented 8 months ago

I propose keeping the files for 1 year. This means deleting files that are over 1 year old. Let's start with this.

The other approach that is a bit more sophisticated is as follows:

The reason I like to keep each version of the data is to be able to trace issues. Another reason is to do statistics. The Databus stores the version of each artifact, and displays it in the annotation of each triple. This is the general approach of the Databus which uses versioned artifacts like in software but with data.

sahalali commented 8 months ago

Added a new life cycle configuration to the S3 Bucket to clean up the bucket by permanently removing files older than 365 days.

saumier commented 8 months ago

@sahalali Thx for the bucket config. I am closing this issue now. Please continue with the Signé Laval workflow to get the CMS data into Footlight isssue https://github.com/culturecreates/artsdata-planet-footlight/issues/10