Use Sitemap for selecting IPAA web pages

saumier commented 9 months ago

The IPAA website has a sitemap https://ipaa.ca/open-data.xml. This sitemap should include pages that IPAA wants to be crawled, and exclude pages that IPAA does not want to be crawled because the person or organization does not want to publish open data. This is very important because Artsdata must respect the wishes of IPAA regarding permissions.

[x] Check that the sitemap can be used to get a majority of webpages from the following sections: People: https://ipaa.ca/indigenous-artists/ Organizations: https://ipaa.ca/indigenous-organizations/ Allies (mixed types): https://ipaa.ca/allies-non-voting/
[x] If YES, then use sitemap instead of crawling all entites. If NO, then carefully document reasons in the comments below and skip sitemap.

dev-aravind commented 8 months ago

@saumier The people and organizations can be obtained by the sitemap. But the number of allies that can be obtained from scraping is much more compared to that in the sitemap.(61 vs 23). Should I proceed with the sitemap method?

saumier commented 8 months ago

@dev-aravind Lets proceed with using sitemap.xml for people and organizations.

dev-aravind commented 8 months ago

@saumier I've added a new ruby file and a github workflow to scrape using the sitemap here

saumier commented 8 months ago

@dev-aravind I am moving this issues to the artsdata-planet-ipaa repo.

saumier commented 8 months ago

@dev-aravind You can add the second SPARQL now (see my request for changes).

dev-aravind commented 8 months ago

@saumier The second SPARQL is now up in the PR.

saumier commented 8 months ago

closing this for now. I will circle back to remove the duplicate orgs a bit later.

culturecreates / artsdata-planet-ipaa

Use Sitemap for selecting IPAA web pages #6