The error below only happens for some resources, right at the end of harvesting the resource, probably when we're calling cio/delete-file-recursively. I'll get the full logs once all harvesting finishes.
It looks like it could just be a random issue where a file is deleted before the list of files to delete is updated, so Hadoop tries to delete it again - this would cause an error since the file is already gone.
One course of action would be to do a test harvest of the individual resources that see this error this time around. If the error is repeatable for specific resources, we'll want to investigate further. Otherwise, we can chalk it up to the quirks of working with distributed systems.
3/07/12 04:31:16 INFO mapred.TaskRunner: Task 'attempt_local_0013_m_000004_0' done.
Exception in thread "flow" java.lang.NullPointerException
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.delete(NativeS3FileSystem.java:304)
at cascading.tap.hadoop.util.Hadoop18TapUtil.cleanTempPath(Hadoop18TapUtil.java:222)
at cascading.tap.hadoop.util.Hadoop18TapUtil.cleanupTapMetaData(Hadoop18TapUtil.java:185)
at cascading.flow.hadoop.HadoopFlowStep.cleanTapMetaData(HadoopFlowStep.java:272)
at cascading.flow.hadoop.HadoopFlowStep.clean(HadoopFlowStep.java:257)
at cascading.flow.hadoop.HadoopFlowStep.clean(HadoopFlowStep.java:69)
at cascading.flow.planner.BaseFlowStep.clean(BaseFlowStep.java:641)
at cascading.flow.hadoop.HadoopFlow.cleanTemporaryFiles(HadoopFlow.java:204)
at cascading.flow.hadoop.HadoopFlow.internalClean(HadoopFlow.java:251)
at cascading.flow.BaseFlow.run(BaseFlow.java:1092)
at cascading.flow.BaseFlow.access$100(BaseFlow.java:77)
at cascading.flow.BaseFlow$1.run(BaseFlow.java:749)
at java.lang.Thread.run(Thread.java:722)
"Done harvesting" "mammalogyspecimens"
The error below only happens for some resources, right at the end of harvesting the resource, probably when we're calling cio/delete-file-recursively. I'll get the full logs once all harvesting finishes.
It looks like it could just be a random issue where a file is deleted before the list of files to delete is updated, so Hadoop tries to delete it again - this would cause an error since the file is already gone.
One course of action would be to do a test harvest of the individual resources that see this error this time around. If the error is repeatable for specific resources, we'll want to investigate further. Otherwise, we can chalk it up to the quirks of working with distributed systems.