datatogether / sentry

Parallelized web crawler written in Golang
GNU Affero General Public License v3.0
14 stars 6 forks source link

Heritrix parity #11

Open b5 opened 7 years ago

b5 commented 7 years ago

We should run heritrix against a set of urls, copy the output, and write a set of tests that check sentry's output against the heretrix output as a form of ground truth

b5 commented 7 years ago

As a note, this should only apply to files deemed relevant. I don't think we're concerned about things like directory structure or configuration files