culturecreates / artsdata-orion

Collection of data sources loaded into Artsdata by Culture Creates
0 stars 0 forks source link

CSRF - ETL scenesfrancophones.ca #11

Closed saumier closed 9 months ago

saumier commented 11 months ago

This project is to scrape the JSON-LD from the following lists on the website scenesfrancophones.ca:

  1. https://scenesfrancophones.ca/diffuseurs (about 131 organizations)
  2. https://scenesfrancophones.ca/artistes (about 181 artists)
  3. https://scenesfrancophones.ca/spectacles (about 56 events)

The data should be saved in individual files (one for each listing) in JSON-LD in a Github repository called artsdata-planet-scenesfrancophones-ca.

### Tasks
- [ ] https://github.com/culturecreates/artsdata-planet-scenesfrancophones/issues/6
- [ ] https://github.com/culturecreates/artsdata-planet-scenesfrancophones/issues/7
- [ ] https://github.com/culturecreates/artsdata-planet-scenesfrancophones/issues/8
saumier commented 11 months ago

@dev-aravind Hi. The first step is to get the list of webpages. I understood that you are looking at Python. This step will create a list of urls of webpages.

The second step is to get the structured data from each webpage (sometimes called 'linked data' or 'RDF'). This second step should use a library that already knows how to extract structured data because there are too many different considerations for us to code ourselves. I cannot stress this enough.

Some of the considerations are messy HTML with invalid code and missing tags. Another consideration is that the JSON-LD on the webpage may be in several different Githubissues.

  • Githubissues is a development platform for aggregating issues.