VertNet / gulo

Shredding Darwin Core Archives with ferocity, strength, and Cascalog.
7 stars 5 forks source link

Update DwCA reader #122

Closed tucotuco closed 9 years ago

tucotuco commented 10 years ago

So, in gulo we take advantage of dwca-reader wrapper (https://github.com/VertNet/dwca-reader-clj/blob/develop/src/clj/dwca/core.clj) to the GBIF Darwin Core Reader.

The goal is to get gulo using the latest DwC-A code base and to get it to use the correct openArchive method from that code (see https://github.com/VertNet/gulo/issues/116). Specifically, need to:

1) Do a quick code walk through for core.clj 2) Understand how the Java is invoked 3) Update gulo to use the reader from GBIF (https://github.com/gbif/dwca-reader/) 4) Make sure the method for reading the archive is the one passing a temp directory to work in rather than the one with the archive as a single argument 5) Assure that harvest is generating Simple Darwin Core plus the info from the CartoDB resource table.

robinkraft commented 10 years ago
robinkraft commented 10 years ago

For the 4th item in my previous comment, look at my comment on #116. I think the archive reader problem is actually hitting a case that's not supported by the GBIF DWCA reader.

tucotuco commented 10 years ago

For the fifth item, GBIF has made the update.

Addressed in http://dev.gbif.org/issues/browse/POR-2395

https://github.com/gbif/dwca-reader/commit/903d10236b3b2cda46a0d2b3e994e0ce658c328e

robinkraft commented 10 years ago

Awesome! Have they released a new snapshot?

tucotuco commented 9 years ago

I believe the new snapshot is at http://repository.gbif.org/content/repositories/snapshots/org/gbif/dwca-reader/1.19-SNAPSHOT/

tucotuco commented 9 years ago

Branch develop set to use

https://clojars.org/dwca-reader-clj/versions/0.10.1-SNAPSHOT

which in turn uses

http://repository.gbif.org/content/repositories/snapshots/org/gbif/dwca-reader/1.20-SNAPSHOT/. This SNAPSHOT solves the issue of missing Dublin Core fields, introduces the new Darwin Core changes as of 2014-10-30 (see http://rs.tdwg.org/dwc/terms/history/decisions/index.htm; add Organism terms and deprecates the Dublin Core term "rights" in favor of "license").

Branch dwc2013 created to be able to easily do the old-style harvest used in the portal indexing as of 2014-12-22 using

https://clojars.org/dwca-reader-clj/versions/0.8.0-SNAPSHOT

which in turn used

http://repository.gbif.org/content/repositories/snapshots/org/gbif/dwca-reader/1.9.1-SNAPSHOT/.