VertNet / gulo

Shredding Darwin Core Archives with ferocity, strength, and Cascalog.
7 stars 5 forks source link

Unable to harvest DwCA zip file with extensions #116

Closed tucotuco closed 9 years ago

tucotuco commented 10 years ago

Plenty of background coming here..

Hi guys!

John has been working very hard on our harvesting and indexing for our portal and we're very close to being able to refresh our portal with a new set of indexed data, but we've been having issues with self-hosted institutions serving DwCA files with extensions. We currently have 8 resources that we can't harvest. I provide the details below.

At first, we've just had issue with two DanBIF resources. They have the IPT Windows bug where the filenames begin with a slash. So we've been ignoring those two resources when we harvest. But then more recently we encountered issues with Sam Noble resources. They too have the IPT Windows bug. However, resources with a slash in occurrence.txt do harvest for us. They only fail when there is a second txt file in the archive. Looking back at the DanBIF resources, this holds true. Both of those had a second txt file. But today we experienced OSUM files failing. These files have a second txt file, but they do not have the IPT Windows Bug…no slash in file names.

The big puzzler though, is that every Arctos resource (hosted on our VertNet IPT) does have a second file (image.txt) within each of the 40 archives. These all harvest successfully. Because the OSUM files, do not have the IPT Windows bug, I would have expected them to harvest successfully. But since they didn’t, and because the files with the IPT Windows bug that only contain one txt file do harvest, it doesn't seem like the IPT Windows bug is what is necessarily causing the harvest to fail. I did see this on the Darwin Core issues list, https://code.google.com/p/darwincore/issues/detail?id=95, but again, we are able to harvest all of those Arctos resources containing image.txt.

In the case of DanBIF, the second file is \image.txt In the case of SMONMH, their mammals file is has \resourcerelationship.txt and \measurementorfact.txt. Their tissue file has \measurementorfact.txt In the case of OSUM, the second file is occurrence_images.txt

Each of the IPTs (DanBIF, SMONMH, OSUM and VertNet) is using IPT Version 2.0.5-r4398-security-update-1.

There error message received when these archives fail to harvest is: "Downloading records from http://hymfiles.biosci.ohio-state.edu:8080/ipt/resource.do?r=osum-amphibians" "Error harvesting" "http://hymfiles.biosci.ohio-state.edu:8080/ipt/resource.do?r=osum-amphibians" "The archive given is a folder with more or less than 1 data files having a txt or csv suffix" "ERROR: Resource http://hymfiles.biosci.ohio-state.edu:8080/ipt/resource.do?r=osum-amphibians (The archive given is a folder with more or less than 1 data files having a txt or csv suffix)"

We really want to be able to harvest all of the occurrence data from these files. We're not currently using any of the extension data. I believe Kyle had indicated that GBIF had instituted a work around to handle files harvesting when the archives contained the IPT Windows Bug, but have you encounter this issue with extensions other than image.txt? Do you use the Darwin Core Reader for the GBIF harvester? We'd appreciate any guidance in figuring this out. I'm happy to post an issue on one of the sites, but I just couldn't figure out if it should go on IPT or Darwin Core or maybe there is something we need to modify in our own code, but since we rely on the dwca-reader, we didn't think so.

Laura Russell

Hi Laura, John

This is wild speculation, but that error suggests misuse of the DwC-A Reader… The archive given is a folder with more or less than 1 data files having a txt or csv suffix” There are several openArchive methods. Compare these:

Archive archive = ArchiveFactory.openArchive(FileUtils.getClasspathFile(“/tmp/occurrence.csv.gz”))); Archive archive = ArchiveFactory.openArchive(FileUtils.getClasspathFile(“/tmp/archiveFolder”))); Archive archive = ArchiveFactory.openArchive(FileUtils.getClasspathFile(“/tmp/archive.zip”), new File(System.getProperty(“java.io.tmpdir”)));

The first takes a zipped single file (no extensions) or a directory (can have extensions), the third is taking an archive file (can have extensions) and providing a temporary directory into which it can be extracted. If you pass a zip file with extensions to the first 2 examples I believe it will give the kind of errors you list below. We do only use the ArchiveReader for reading archives.
The code you are running is: https://code.google.com/p/darwincore/source/browse/trunk/dwca-reader/src/main/java/org/gbif/dwc/text/ArchiveFactory.java#262 It sounds like you are calling the single argument method with an archive but should be passing a temporary directory it can work in like so: Archive archive = ArchiveFactory.openArchive(FileUtils.getClasspathFile(“/tmp/archive.zip”), new File(System.getProperty(“java.io.tmpdir”)));

Please let us know if this fixes things or not. Have a nice weekend, Tim

Looking at your clojure wrapper you have:

download https://github.com/VertNet/dwca-reader-clj/blob/develop/src/clj/dwca/core.clj#L62 unzip https://github.com/VertNet/dwca-reader-clj/blob/develop/src/clj/dwca/core.clj#L70 get-records https://github.com/VertNet/dwca-reader-clj/blob/develop/src/clj/dwca/core.clj#L76

Could it be you are failing to call the unzip perhaps and the calling get-records with the zipped file, and not the unzipped folder? That would do it…

Hi John,

Can you zip up your file system and send me it after this bit is run?

https://github.com/VertNet/dwca-reader-clj/blob/develop/src/clj/dwca/core.clj#L89

The error suggests it is not unzipping correctly, or the (get-records archive-path))) is getting the wrong archive-path somehow.

The DwC-A reader can open those archives, as I just checked with the validator on tools.gbif.org.

Cheers, Tim

tucotuco commented 10 years ago

OSUM Birds (http://hymfiles.biosci.ohio-state.edu:8080/ipt/resource.do?r=osum-birds) is the only one currently manifesting this problem.

robinkraft commented 10 years ago

Rather than try to debug Aaron’s DWCA reader, I’ve gone back to the Java code directly, invoking it directly to open the problem DWCAs. It seems the problem actually lies with the GBIF DWCA reader.

This is my test case:

http://hymfiles.biosci.ohio-state.edu:8080/ipt/resource.do?r=osum-amphibians

I get this error with the standard method to open the zip file:

> (ArchiveFactory/openArchive (File. "/tmp/dwca-osum-amphibians.zip") (File. "/tmp/dwca"))
UnsupportedArchiveException The archive given is a folder with more or less than 1 data files having a txt or csv suffix  org.gbif.dwc.text.ArchiveFactory.openArchive (ArchiveFactory.java:317)

If I try unzipping the archive first (same thing Aaron’s DWCA reader does), I get the same error.

So it turns out that if you use that openArchive method on a zipped DWCA, it unzips the file, then calls this method for handling unzipped DWCA directories. Let’s step through that method:

Line 274 is looking for a meta.xml file in the uncompressed directory. In the problem DWCAs, that file is called \meta.xml, so I think it doesn’t find it, and therefore skips over the if statement on line 277. Thus, it jumps to the else statement on 287, which has a comment that says "currently support a single data file or a folder which contains a single data file".

Sounds like our problem … If I remove the multimedia.txt file from that folder, it opens just fine. Here’s what happens without that multimedia.txt file. I’m grabbing the last record and extracting all the fields, mixing the GBIF Java methods with Aaron’s “field-val” function:

> (-> (File. "/tmp/dwca-osum-amphibians")
       ArchiveFactory/openArchive 
       .iteratorDwc
       iterator-seq
       vec
       last
       field-vals)
["urn:lsid:biosci.ohio-state.edu:osuc_occurrences:OSUM__Amphibians__997" nil nil nil nil nil nil nil "OSUM Amphibians 997" nil nil nil "United States" nil "Marion" "2012" nil "40.2503" "-83.0002" nil nil nil nil nil nil nil nil nil nil nil nil nil "#101, 11 Jul 1942, Species on tag: Ambystoma texanum; Shelf Unit: 1; Tray Number: 19" nil nil nil nil nil nil nil "Locality: Tetrapods" "GeoNames (Auto-Correction)" nil nil nil nil nil nil nil nil nil nil nil nil nil "12" nil nil nil nil nil nil nil nil "adult" nil "1/2 mi S of Marion, Marion Co., Ohio" nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil "Rausch, R. (Robert)" nil "none specified" "undetermined" nil "Ohio" nil nil nil nil nil nil "11 July 1942" nil nil nil nil nil nil nil nil nil nil nil nil nil "PreservedSpecimen" nil nil nil nil nil nil nil nil nil nil nil nil nil nil "Urodela" nil nil nil nil nil nil "Ambystoma texanum" "urn:lsid:biosci.ohio-state.edu:osuc_names:316947" nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil "Amphibians" nil nil nil "Ohio State University - Amphibian Division, Columbus, OH (OSUM)" nil nil]
tucotuco commented 10 years ago

Reported to GBIF at http://dev.gbif.org/issues/browse/POR-2396.

robinkraft commented 10 years ago

Nice! Thanks @tucotuco.

tucotuco commented 10 years ago

@robinkraft Which version of the dwca-reader were you testing against for the ArchiveFactory/openArchive call? GBIF thinks it may already have been fixed.

tucotuco commented 10 years ago

[8/21/14 1:27:42 PM] Tim: can you please test that on 1.19-SNAPSHOT ? [8/21/14 1:27:50 PM] Tim: I think we fixed this a long time ago

Ok. Please also note it's own dependencies:

http://repository.gbif.org/content/repositories/snapshots/org/gbif/dwca-reader/1.19-SNAPSHOT/dwca-reader-1.19-20140710.190941-1.pom

and in particular:

1.8 0.16 1.0.2.2 I suspect this might actually have been fixed in gbif-common....
robinkraft commented 10 years ago

Latest version I thought, whatever that is. And I was pointing at code in the master branch. Looks like Tim committed a fix this morning.

On Aug 21, 2014, at 5:52 AM, John Wieczorek notifications@github.com wrote:

@robinkraft Which version of the dwca-reader were you testing against for the ArchiveFactory/openArchive call? GBIF thinks it may already have been fixed.

— Reply to this email directly or view it on GitHub.

tucotuco commented 9 years ago

Hack 4c27248c125b38c1d4af0137693c65963ae4bc56 for old IPT bug for archive with additional files was committed by Tim. Looks like the snapshot is here

http://repository.gbif.org/content/repositories/snapshots/org/gbif/dwca-reader/1.19-SNAPSHOT/

tucotuco commented 9 years ago

From @robinkraft . We’re just about set to polish this thing off, but I just ran into a separate issue with Ohio State’s resources: they name their directories incorrectly. Or at least inconsistently.

Check out a standard archive here:

http://ipt.vertnet.org:8080/ipt/resource.do?r=ccber_mammals

If you unzip it, the file unzips to "dwca-ccber_mammals". Easy! The dwca-reader-clj expects that predictable, 1-1 match between the resource name and the directory structure within the zip file.

Ohio State’s resources, or at least the one I looked at, unzips to a folder called “dwca_419”, which breaks dwca-reader-clj even when I modify it to use the ArchiveFactory properly to deal with the extra media files. I could inspect the zip file to ensure we use the correct directory path, etc., but I haven’t done that before so it won’t be a 5-minute thing.

tucotuco commented 9 years ago

I believe that non-standard directory must be a legacy from the earlier installation. We can ask them to redo that one resource adn in the meantime try to publish the archive to our IPTstrays for harvesting. We'll call this one closed from the gulo perspective.