culturecreates / artsdata-orion

Collection of data sources loaded into Artsdata by Culture Creates
0 stars 0 forks source link

Analyze LaVitrine site for crawling #68

Closed saumier closed 2 months ago

saumier commented 2 months ago

Analyze the website lavitrine.com to see if we can crawl their events using our current code in Orion, or if there is development needed in our current ruby/workflows. You can try loading it to see. If there is development needed, please describe it and give a rough work estimate.

Change requested: Use the workflow artifacts to save the dump after fetching the data instead of committing to the repo. The artifact should be set to last for 8 days before being automatically removed. If this approach is successful then we can replace it everywhere instead of committing the file to the repo, and add a parameter to set the duration for keeping the artifact in Github with a default of 8 days ( 1 week plus 1 day).

Here is a complete event listing link

This is a large site with about 7000 events. Also, there are different sections for Events, Exhibitions, Festivals and series.

We also want to crawl Artists and Organizations, and Places.

Specifically, is there any ruby or workflow development needed to get the JSON-LD into Artsdata without worrying about the quality of the JSON-LD (ok to ignore SHACL violations).

Can we:

dev-aravind commented 2 months ago

@saumier

Rough estimate for changes needed:

Write a script to get all the ShadowRoot information from the website for the listing page - 3h Update orion to support getting entity URLs using the ShadowRoot method. - 3h Crawling the data - 2h Use workflow artifacts to save the dump - 3h Testing -1h

Total - 9h

saumier commented 2 months ago

@dev-aravind Did you include in your estimate the change requested: Use workflow artifacts to save the dump?

dev-aravind commented 2 months ago

@saumier The estimate is now updated.