VertNet / gulo

Shredding Darwin Core Archives with ferocity, strength, and Cascalog.
7 stars 5 forks source link

investigate new null pointer exception - deleting files after harvest #88

Closed robinkraft closed 11 years ago

robinkraft commented 11 years ago

The error below only happens for some resources, right at the end of harvesting the resource, probably when we're calling cio/delete-file-recursively. I'll get the full logs once all harvesting finishes.

It looks like it could just be a random issue where a file is deleted before the list of files to delete is updated, so Hadoop tries to delete it again - this would cause an error since the file is already gone.

One course of action would be to do a test harvest of the individual resources that see this error this time around. If the error is repeatable for specific resources, we'll want to investigate further. Otherwise, we can chalk it up to the quirks of working with distributed systems.

3/07/12 04:31:16 INFO mapred.TaskRunner: Task 'attempt_local_0013_m_000004_0' done.
Exception in thread "flow" java.lang.NullPointerException
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem.delete(NativeS3FileSystem.java:304)
        at cascading.tap.hadoop.util.Hadoop18TapUtil.cleanTempPath(Hadoop18TapUtil.java:222)
        at cascading.tap.hadoop.util.Hadoop18TapUtil.cleanupTapMetaData(Hadoop18TapUtil.java:185)
        at cascading.flow.hadoop.HadoopFlowStep.cleanTapMetaData(HadoopFlowStep.java:272)
        at cascading.flow.hadoop.HadoopFlowStep.clean(HadoopFlowStep.java:257)
        at cascading.flow.hadoop.HadoopFlowStep.clean(HadoopFlowStep.java:69)
        at cascading.flow.planner.BaseFlowStep.clean(BaseFlowStep.java:641)
        at cascading.flow.hadoop.HadoopFlow.cleanTemporaryFiles(HadoopFlow.java:204)
        at cascading.flow.hadoop.HadoopFlow.internalClean(HadoopFlow.java:251)
        at cascading.flow.BaseFlow.run(BaseFlow.java:1092)
        at cascading.flow.BaseFlow.access$100(BaseFlow.java:77)
        at cascading.flow.BaseFlow$1.run(BaseFlow.java:749)
        at java.lang.Thread.run(Thread.java:722)
"Done harvesting" "mammalogyspecimens"
robinkraft commented 11 years ago

No longer relevant - we're removing Cascalog entirely.