edgi-govdata-archiving / guides

Technical guides for how to preserve and hold data
https://edgi-govdata-archiving.github.io/guides/
2 stars 0 forks source link

Update guide template #17

Open dcwalk opened 7 years ago

dcwalk commented 7 years ago

After working through the process with a guide ( #16) it looks like the template needs to be updated... this is what was used in that PR, and these fields should be in all YAML frontmatter...

---
title: "Understanding the Internet Archive Web Crawler"
permalink: guide/internet-archive-crawler/
excerpt: "In this document we explain what Heritrix can do, why it needs our help, and also how to identify documents and datasets that Heritrix can’t reach."
author: Matt Price
date: 2016-12-20
modifiedDate: 2017-02-14
layout: single
---

The Internet Archive's [End of Term nomination tool](http://digital2.library.unt.edu/nomination/eth2016/) asks the IA web crawler, [Heritrix](http://crawler.archive.org/), to start crawling from URLs and go 2-3 clicks down in a website’s link hierarchy.

This guide explains what Heritrix can do, why it needs our help, and how to identify documents and datasets that Heritrix **can’t** reach.

<!-- This will add a Table of Contents -->
{% include toc %}

## How a Web Crawler Works
dcwalk commented 7 years ago

modifiedDate is optional, should only be added if there is an update to the guide