Crawl of IPAA members - Githubissues

saumier commented 10 months ago

ETL individual people and organizations and commit the JSON-LD files to the /output directory in this repo.

saumier commented 10 months ago

@dev-aravind Please use the same approach as in scenefrancophones.ca but schedule the ETL every 6 months. There are no events so they don't need to be crawled as frequently. Don't worry about duplicate code at the moment since this is on a tight deadline. After a few different websites we will analyze the general needs and see if we can create a Ruby gem and Github Action that we can then reuse everywhere. So later we will come back and refactor these repos.

dev-aravind commented 10 months ago

@saumier I've added workflows to commit the json-ld files.

saumier commented 10 months ago

@dev-aravind Looking good :-) Please complete the pipeline all the way to Artsdata using artifact names ipaa-artists and ipaa-organizers. You can make tweaks to the workflows from scenesfrancophones repo and also please add a .rvmrc so whenever we cd into this project it will set the correct ruby version and load the gemset. You can use this in the .rvmrc as follows rvm use 3.1.2@artsdata-planet-ipaa --create. I have a general rule that each repo should have its own gemset. In this case we are using the same version of ruby 3.1.2 but a new gemset to keep the gems independant.

dev-aravind commented 10 months ago

@saumier I updated the pipeline to include push to artsdata. However, the artsdata-planet-ipaa repository couldn't access the PUBLISHER_URI_GREGORY secret so the push to artsdata was not working during my test runs. Can you look into this?

saumier commented 10 months ago

@dev-aravind I have made the PUBLISHER_URI_GREGORY available to the artsdata-planet-ipaa repository.

saumier commented 10 months ago

@dev-aravind This web site has a lot of good JSON-LD but is lacking URIs for people and organizations. When there are no URIs the RDF graph uses what is called a "blank node". This makes it difficult to use the data because a "blank node" cannot be referenced externally (blank node identifiers are only limited in scope to a serialization of a particular RDF graph).

To remedy this situation we need to generate URIs for each of the blank nodes in the graph, and we need to add the webpage provenance to each entity before sending to Artsdata.

Here is your challenge:

For each subject on each web page add the triple: prov:wasDerivedFrom .
Replace all blank nodes using a SPARQL function UUID() to generate URIs.

Once again, this can be done in Ruby code or using SPARQL.

If you use SPARQL then you can generate the SPARQL with a custom function in ruby like my_sparql_template(entity) and the main.rb will look something like this:

webpage_graph = RDF::Graph.load(entity)
webpage_graph.query(my_sparql_template(entity))
graph << webpage_graph

ALSO, the entities are type Organization not organizer. Please change everywhere in this repo where you wrote organizer to organization. Unlike the previous project with scenesfrancophones, the web pages are all members and not the entity that did the organizing of an event. An entity doing the organizing can be a Person or an Organization. In the case of IPAA an organization may be a music group of 2 people doing the performance and yet another organization may be hiring the music group (role of organizer) so the music group (role of performer) can do a performance. It is unfortunate that the words 'organizer' and 'organizing' are so similar in English because they mean different things.

saumier commented 10 months ago

@dev-aravind I created the tests and SPARQL for replacing all blank nodes. It was quite advanced so I went ahead and added it to the main.rb. You can cd tests and then ruby replace_blank_nodes_test.rb to see the tests pass. Have a look at the tests and you can ask me questions.

I used the SPARQL function UUID() which generates URIs with the urn: protocol. For some reason these URIs are not clickable on Nebula, so I will look into that on Nebula side.

Next is to add the triples: prov:wasDerivedFrom for each web page that we crawl. Try to create a test for that following my pattern of tests.

dev-aravind commented 10 months ago

@saumier I raised a PR for you to review with all the changes that you requested. I had some concerns about the data:

1: While adding the derivedFrom triple the CreativeWork type had multiple values for derivedFrom. (see comment).

2: The "Indigenous Performing Arts Alliance" organization was duplicated as it was available in every webpage.

saumier commented 10 months ago

@dev-aravind I requested some changes to your PR https://github.com/culturecreates/artsdata-planet-ipaa/pull/4.

saumier commented 10 months ago

@dev-aravind I modified your test. Now you can try to modify the SPARQL to make the test pass ;-). You may want to split the test in 2 tests.

dev-aravind commented 10 months ago

@saumier I updated the PR to add the wasDerivedFrom triple to only top level nodes. Please look into and let me know if you see any issues.

saumier commented 10 months ago

@looks good. I merged and closed this issue.

culturecreates / artsdata-planet-ipaa

Crawl of IPAA members #1