VertNet / gulo

Shredding Darwin Core Archives with ferocity, strength, and Cascalog.
7 stars 5 forks source link

Fixed IPT resource page parsing to get record count #104

Closed robinkraft closed 11 years ago

robinkraft commented 11 years ago

Addresses #103

eightysteele commented 11 years ago

Nice! #shipit On Sep 10, 2013 6:27 PM, "Robin Kraft" notifications@github.com wrote:

Addresses #103 https://github.com/VertNet/gulo/issues/103

You can merge this Pull Request by running

git pull https://github.com/VertNet/gulo feature/fix-103

Or view, comment on, or merge it at:

https://github.com/VertNet/gulo/pull/104 Commit Summary

  • Fixed IPT resource page parsing to get record count

File Changes

  • M src/clj/gulo/harvest.cljhttps://github.com/VertNet/gulo/pull/104/files#diff-0(4)
  • M test/gulo/harvest_test.cljhttps://github.com/VertNet/gulo/pull/104/files#diff-1(4)

Patch Links:

robinkraft commented 11 years ago

d95e7e1 includes a parser update to support multiple versions of IPT. It assumes that the counts field can be extracted from a string that looks something like download (17 MB) 505,538 records. You just need to grab the number before records once any extra whitespace is removed. This has been successfully tested with all resources currently in the resource_staging CartoDB table.

Well, these two haven't been tested because the server seems to be down:

http://specify.thomasmore.edu:8080/ipt/resource.do?r=cmc_herpetology_vouchers
http://specify.thomasmore.edu:8080/ipt/resource.do?r=cmc_ornithology_grc

@laurarussell do you have any idea what's going on with this servers? Is the address correct?

laurarussell commented 11 years ago

Yes, I mentioned in another thread the other day that CMC was down. I've notified Herm Mays. I'll let you know when he responds.