Closed saumier closed 8 months ago
@saumier The people and organizations can be obtained by the sitemap. But the number of allies that can be obtained from scraping is much more compared to that in the sitemap.(61 vs 23). Should I proceed with the sitemap method?
@dev-aravind Lets proceed with using sitemap.xml for people and organizations.
@saumier I've added a new ruby file and a github workflow to scrape using the sitemap here
@dev-aravind I am moving this issues to the artsdata-planet-ipaa repo.
@dev-aravind You can add the second SPARQL now (see my request for changes).
@saumier The second SPARQL is now up in the PR.
closing this for now. I will circle back to remove the duplicate orgs a bit later.
The IPAA website has a sitemap https://ipaa.ca/open-data.xml. This sitemap should include pages that IPAA wants to be crawled, and exclude pages that IPAA does not want to be crawled because the person or organization does not want to publish open data. This is very important because Artsdata must respect the wishes of IPAA regarding permissions.
[x] Check that the sitemap can be used to get a majority of webpages from the following sections: People: https://ipaa.ca/indigenous-artists/ Organizations: https://ipaa.ca/indigenous-organizations/ Allies (mixed types): https://ipaa.ca/allies-non-voting/
[x] If YES, then use sitemap instead of crawling all entites. If NO, then carefully document reasons in the comments below and skip sitemap.