Closed saumier closed 9 months ago
@dev-aravind Hi. The first step is to get the list of webpages. I understood that you are looking at Python. This step will create a list of urls of webpages.
The second step is to get the structured data from each webpage (sometimes called 'linked data' or 'RDF'). This second step should use a library that already knows how to extract structured data because there are too many different considerations for us to code ourselves. I cannot stress this enough.
Some of the considerations are messy HTML with invalid code and missing tags. Another consideration is that the JSON-LD on the webpage may be in several different Githubissues.
This project is to scrape the JSON-LD from the following lists on the website scenesfrancophones.ca:
The data should be saved in individual files (one for each listing) in JSON-LD in a Github repository called artsdata-planet-scenesfrancophones-ca.