DataONEorg / mnlite

Light weight read-only DataONE member node in Python Flask
Apache License 2.0
0 stars 0 forks source link

Allow url pattern matching to be set from `settings.json` #50

Closed iannesbitt closed 11 months ago

iannesbitt commented 11 months ago

NSIDC has a lot of extraneous pages that don't need to be crawled, and their datasets all have the phrase "/versions/" in the url. We should be able to set "url_match": "/versions/" to tell the spider which are the acceptable pages to crawl. Users should be able to set this as a list.

iannesbitt commented 11 months ago

This has been tested on the NSIDC corpus and is working.