culturecreates / artsdata-planet-ipaa

Indigenous Performing Arts Association - workflows for loading data into Artsdata
The Unlicense
0 stars 0 forks source link

Use Sitemap for selecting IPAA web pages #6

Closed saumier closed 8 months ago

saumier commented 9 months ago

The IPAA website has a sitemap https://ipaa.ca/open-data.xml. This sitemap should include pages that IPAA wants to be crawled, and exclude pages that IPAA does not want to be crawled because the person or organization does not want to publish open data. This is very important because Artsdata must respect the wishes of IPAA regarding permissions.

dev-aravind commented 8 months ago

@saumier The people and organizations can be obtained by the sitemap. But the number of allies that can be obtained from scraping is much more compared to that in the sitemap.(61 vs 23). Should I proceed with the sitemap method?

saumier commented 8 months ago

@dev-aravind Lets proceed with using sitemap.xml for people and organizations.

dev-aravind commented 8 months ago

@saumier I've added a new ruby file and a github workflow to scrape using the sitemap here

saumier commented 8 months ago

@dev-aravind I am moving this issues to the artsdata-planet-ipaa repo.

saumier commented 8 months ago

@dev-aravind You can add the second SPARQL now (see my request for changes).

dev-aravind commented 8 months ago

@saumier The second SPARQL is now up in the PR.

saumier commented 8 months ago

closing this for now. I will circle back to remove the duplicate orgs a bit later.