At the SSH SIG meeting of 20 June, it turned out that 4 recent (or even still running) projects led by Reggie, Kody, Flavio and Olga were/are doing web scraping of several kinds and yet another was using pre-scraped data (FIRST, led by Laura).
This is not a completely new phenomenon either; 10 years ago we had scraped a KB (royal library) newspaper dataset and used it in many projects. It seems an especially SSH-y topic, but was also relevant in the past for deep learning when that was new (e.g. we scraped car images for project Sherlock).
Perhaps given all this it makes sense to devote some words to how to perform this task well. One could do this shallowly (just describe the scraping tools and techniques we have experience with) or a bit more deeply (e.g. how to go from scraping to a clean, shareable, open dataset). I think it would make a nice addition to the Dataset chapter.
The Turing Way only mentions scraping in passing (here).
At the SSH SIG meeting of 20 June, it turned out that 4 recent (or even still running) projects led by Reggie, Kody, Flavio and Olga were/are doing web scraping of several kinds and yet another was using pre-scraped data (FIRST, led by Laura).
This is not a completely new phenomenon either; 10 years ago we had scraped a KB (royal library) newspaper dataset and used it in many projects. It seems an especially SSH-y topic, but was also relevant in the past for deep learning when that was new (e.g. we scraped car images for project Sherlock).
Perhaps given all this it makes sense to devote some words to how to perform this task well. One could do this shallowly (just describe the scraping tools and techniques we have experience with) or a bit more deeply (e.g. how to go from scraping to a clean, shareable, open dataset). I think it would make a nice addition to the Dataset chapter.
The Turing Way only mentions scraping in passing (here).