culturecreates / artsdata-orion

Collection of data sources loaded into Artsdata by Culture Creates
0 stars 0 forks source link

capitol-nb-ca workflow failing #94

Open saumier opened 4 days ago

saumier commented 4 days ago

The workflow for capitol-nb-ca is failing after being updated. The website has improved their JSON-LD. Gregory removed the custom location from the workflow and set the mode "fetch-push".

https://github.com/culturecreates/artsdata-orion/actions/workflows/capitolnb-events.yml

dev-aravind commented 3 days ago

@saumier This is happening because the october events are now missing in the Capitol.nb website and the Artsdata Pipeline is configured to stop looking for event URLs when it doesn't find any new ones in a page.

This can be solved in 2 ways:

  1. We just crawl the first page.
  2. We can use their pagination strategy, but it leads to redundant requests to their website. The website’s pagination displays 12 events per page, and the page numbers reflect this. The first page shows events 0–11, and the next page starts at event 12, effectively skipping any intermediate pages. This means that calling pages in a sequential order (1, 2, 3, ...) does not work as expected; only certain pages (0, 12, 24, etc.) will display unique sets of events. This method will fetch all events in their website.

Let me know what you think.

saumier commented 3 days ago

I think crawling the first page was ok in the beginning when we wanted to load data and see what it contained. But now we need a better solution that gets all their pages of events.

Please consider adding a parameter like "offset" to increment the pagination by more than one. In this case the offset would be 12 and the API calls would be https://capitol.nb.ca/en/tickets-events?start=1 followed by https://capitol.nb.ca/en/tickets-events?start=12 and then https://capitol.nb.ca/en/tickets-events?start=24