overview

Attempts to create a sitemap for www.epa.gov

demo

official web url is www3.epa.gov but this is not useful for crawling

files have been added to this repo:

https://github.com/c4software/python-sitemap.git

README_ORIGINAL.md renamed from README.md

config.json

config.py

crawler.py

main.py

www.epa.gov.xml (first 6 lines are for sitemap xml schema)

www.epa.gov.csv (first line is header row)

cleaned_nofiles.txt (for astrid by request)

1) clone this repo

git clone http://github.com/openciti/epa

2) enter repo directory cd epa

3) run main.py specifying a website and output file of your choice. example:

python3 main.py --domain https://www.epa.gov --output www.epa.gov.xml

3a) TODO pass above file values to scripts instead of hard coding them. Its assumed by other scripts that you're running exactly text in step 3)

4) the above output has been saved here for your convenience:

4a) optional (for neo4j guys). convert to csv:

python3 tocsv.py

5) the above output has been saved here for your convenience:

6) optional. strip out links to files

./nofiles.sh

7) the above output has been save here for your convenience: