culturecreates / footlight-aggregator

A tool to inject entities from Artsdata to footlight
0 stars 0 forks source link

Export all Events to Artsdata on a schedule #65

Closed saumier closed 10 months ago

saumier commented 1 year ago

Export all events including linked places, people, orgs to Artsdata in RDF

saumier commented 1 year ago

@sahalali Notes from our design discussion on exporting to Artsdata.

Sequence

  1. Actor 1 calls CMS api to generate RDF
  2. CMS generates RDF locally and returns a blob in the request response (using same pattern as CSV download)
  3. Actor 1 saves the data to S3
  4. Actor 1 call Artsdata Databus and passes it the DownloadURL from S3

To generate JSON-LD lets try to loop across published events to generate JSON like the CMS open API, then add a single @context to convert to JSON-LD. Some transformations may be needed but in the ideal world there are none. We should probably not nest place, organization, person, taxonomy inside events. Instead the JSON can be flat and include all classes. Otherwise the data will repeat. For example we should avoid each event with the same place having the same nested place data.

Actor 1 should be a workflow on Github in a repo called “artsdata-planet-footlight, and the credentials for calling the Artsdata Databus will use the credentials from that repo.

This general approach keeps the CMS responsibilities focused on generating CSV and RDF. CMS does not need to know how to call Artsdata Databus with credentials, nor how to schedule tasks. All extra controls that we had in Capacitor to manage this flow are not needed in CMS backend. Github actions can be used to monitor task success/failure and manage the archive of data dump versions.

Let me know if you have any other concerns at your convenience.

saumier commented 1 year ago

@sahalali I created the Github repo https://github.com/culturecreates/artsdata-planet-footlight

I added your dump file in the a directory called "dump". I added a basic ruby program to frame the data and save the result to the "output" directory. Finally, I passed the data through SHACL from Artsdata and saved a text report in the "output" directory ending in .txt.

The first thing I noticed is that the Event location is missing.

I would like you to walk me through your code so I can make comments directly in the code.

There are also many terms that are 'not' schema.org terms, such as our taxonomies and custom additional types, that we should be prefixing with something like http://kg.footlight.io/ instead of http://schema.org.

Take a look at the output directory and try to improve the results.

If you can send me a dump of the plain json before applying the @context, then we can iterate more quickly on improving the @context.

sahalali commented 1 year ago

json-ld.zip

@saumier The zipped folder contains the plain json file with 25 events, the current json-ld file and the file contains context and frame.

sahalali commented 1 year ago

@saumier Can you please look into it and please assist me in improving json-ld.

saumier commented 1 year ago

@sahalali I created a basic ruby practice repo that starts with a minimal JSON-LD Context and JSON-LD Frame and converts the 25 events and then validates with a minimal SHACL.

Take a look and we can start adding more properties gradually one by one.

The next property we should add is "url". Take a stab and I can comment on this specific case before going any further.

https://github.com/culturecreates/practice-rdf-ruby

saumier commented 11 months ago

@sahalali I am trying to export RDF from Footlight CMS but I get a 504 Undocumented | Error: Gateway Time-out.

'https://api.cms.footlight.io/entities/export?file-format=ttl&entity=Event' \ -H 'calendar-id: 6308ef4a7f771f00431d939a' \

I propose starting simple with the basic properties I am currently uploading manually to Artsdata which are:

Can you get the export to work with those properties?

saumier commented 11 months ago

@sahalali I also noticed that the system was frozen during the download.

saumier commented 11 months ago

@sahalali - The properties need to be fixed as follows: schema:additionalType --> must point to a URI schema:name --> OK schema:location --> must point to a URI that has a type "Place" or "VirtualLocation" schema:address --> must point to a URI that has type "PostalAddress" schema:sameAs --> must point to a URI schema:startDate --> partially OK but schema:startDateTime should not exist schema:endDate --> partially Ok but schema:endDateTime should not exist

saumier commented 11 months ago

@sahalali I realize that this is quite hard because we are trying to "backwards" engineer the @context and the JSON has diverged quite a bit from a schema.org type of JSON-LD. Another approach, maybe more developer friendly, is to use OntoRefine to map the JSON to RDF. Let me know what you think. The Github workflow would do a GET from the Open API and then convert it to RDF and send it to Artsdata. I can help you with the OntoRefine mapping. I think this will be faster as well.

sahalali commented 10 months ago

@saumier I like the idea of using Onto Refine. Can you help me with the Artsdata API that can be used to send data to Artsdata? I will create and prepare a mapping file.

saumier commented 10 months ago

@sahalali Please check the workflows that @dev-aravind has created for Scenes Fracophones. You can use the same variables: https://github.com/culturecreates/artsdata-planet-scenesfrancophones/issues/7

sahalali commented 10 months ago

@saumier Can you please add the secret "PUBLISHER_URI_GREGORY" to the organization-level secret.

saumier commented 10 months ago

@sahalali I am closing this issue from the Footlight CMS project. It is a duplicate of https://github.com/culturecreates/artsdata-orion/issues/3 and https://github.com/culturecreates/artsdata-planet-footlight/issues/10