culturecreates / artsdata-planet-gtq

Import pipeline from Grand Théâtre de Québec to Artsdata
0 stars 0 forks source link

Port GrandTheatreQuebec Huginn to Ruby #1

Open saumier opened 1 year ago

saumier commented 1 year ago

The GrandTheatreQuebec already has a Planet. This is to remove the crawling still happening on Huginn. The workflow in Huginn has an extra step when crawling each page, that is to scrape the html for the keywords of each event page. The keywords is missing from the JSON-LD and is added to JSON-LD by the workflow and then mapped to the GrandTheatreQuebec event type SKOS.

If needed, I can give you access to Huginn

So I propose working in steps (each step can be loaded into Artsdata for review)

saumier commented 2 months ago

@dev-aravind Only a couple of Huginn scenarios left to migrate ;-)

Image

dev-aravind commented 1 day ago

@saumier will add the huginn crawling details here.

saumier commented 1 day ago

@dev-aravind Here is the agent from Huginn. Instead of a CSS class it uses "xpath": "//article[@class=\"show\"]//a" to get the list of @href for the events.

{
  "expected_update_period_in_days": "100",
  "url": [
    "https://grandtheatre.qc.ca/programmation/"
  ],
  "type": "html",
  "mode": "all",
  "extract": {
    "url": {
      "xpath": "//article[@class=\"show\"]//a",
      "value": "concat(\"https://grandtheatre.qc.ca\",@href)"
    }
  },
  "template": {
    "graph_name": "{{graph_name}}"
  }
}