CivicActions / edscrapers

US Department of Education Data Scraping Kit; see https://us-ed-scraping.ckan.io/dataset
GNU Affero General Public License v3.0
15 stars 9 forks source link

REMS scraper #195

Closed nightsh closed 4 years ago

nightsh commented 4 years ago

Needs:

nightsh commented 4 years ago

Example of a URL pattern to be avoided (duplicate and the middle segment is useless for the content):

https://rems.ed.gov/(X(1)S(k5blkmitj1yr3waw5as5nlj4))/Resources_Hazards_Threats_Natural_Hazards.aspx

Here the /(X(1)S(k5blkmitj1yr3waw5as5nlj4)) part needs to be dropped / ignored. Best place to do is probably in the crawler / allowed_regex for the LinkExtractor.

higorspinto commented 4 years ago

ready for review