Attempts to create a sitemap for www.epa.gov
link to tree view (slow loading)
[screenshot] (https://github.com/openciti/epa/raw/master/tree.png)
official web url is www3.epa.gov but this is not useful for crawling
files have been added to this repo:
https://github.com/c4software/python-sitemap.git
README_ORIGINAL.md renamed from README.md
config.json
config.py
crawler.py
main.py
www.epa.gov.xml (first 6 lines are for sitemap xml schema)
www.epa.gov.csv (first line is header row)
cleaned_nofiles.txt (for astrid by request)
1) clone this repo
git clone http://github.com/openciti/epa
2) enter repo directory
cd epa
3) run main.py specifying a website and output file of your choice. example:
python3 main.py --domain https://www.epa.gov --output www.epa.gov.xml
3a) TODO pass above file values to scripts instead of hard coding them. Its assumed by other scripts that you're running exactly text in step 3)
4) the above output has been saved here for your convenience:
4a) optional (for neo4j guys). convert to csv:
python3 tocsv.py
5) the above output has been saved here for your convenience:
6) optional. strip out links to files
./nofiles.sh
7) the above output has been save here for your convenience: