anidata / palantiri

Web crawler to collect data on ht
MIT License
17 stars 8 forks source link

Web scraper redesign #4

Closed bmenn closed 7 years ago

bmenn commented 7 years ago

With an effort to add more sites to our data collection, the scraper should be redesigned to handle generic web pages with minimal processing. Any processing of the page should be post-hoc and handled in the anidata/ht-etl repo. We need the data from other sites since BackPage is now down for the adult services and also does not maintain a long history.

Tasks:

bmenn commented 7 years ago

Address by creation of anidata/rasp