damballa / parkour

Hadoop MapReduce in idiomatic Clojure.
Apache License 2.0
257 stars 19 forks source link

Need mapper to access local file system for input #19

Closed stanfea closed 9 years ago

stanfea commented 9 years ago

Hi,

In relation to this issue: http://stackoverflow.com/questions/10107665/run-a-local-file-system-directory-as-input-of-a-mapper-in-cluster

I'm trying to run my jobs on hadoop. I've built an uberjar and am launching with hadoop jar project.jar I have a custom input format that builds a dseq of filemaps of HDF files to process {"20101201" ["/a.hdf" /b.df"]}

These files are on the local file system. I don't want to put the files on HDFS because the point of this UPLOAD job is to collocate the files needed for each mapper on HDFS.

I'm using hdf-java to open the hdf files. This library uses a native method call to open the hdf file and it seems to be looking on the HDFS file system. Because it results in an Error file not found. I've printed out the fpath it uses and confirmed the file exists and is readable. This works when running the job with lein run.

In the mapper just before reading the hdf file with hdf-java I tried a (->> "/" clojure.java.io/file file-seq (take 10)) and confirmed that it's listing files on my local file system.

So I'm guessing it's because it's a native method as I'm a bit confused why I seem to be on the local file system with the file-seq? Is there anyway to get this to work?

Thanks,

Stefan

llasram commented 9 years ago

I'm not entirely clear on the scenario. How/are you attempting to ensure that your job tasks run on the nodes which have the intended inputs on their local filesystem?

stanfea commented 9 years ago

oops sorry forgot to mention. The input files are on a NAS mounted on all nodes.

llasram commented 9 years ago

In that case, I don't think this is a Parkour or Hadoop issue. If the files really are on a locally-visible filesystem, then access using normal local filesystem APIs should see them. There isn't any magic blocking local access or substituting HDFS. The only way I can think of to unintentionally look on HDFS would be to represent the file paths as Hadoop Path objects then open them via e.g. clojure.java.io/input-stream, which Parkour does extend to handle via the default configured Hadoop filesystem. Potentially verify that your HDF-access functions work as expected in isolation and in a local-mode unit test job?

Feel free to re-open this ticket if I've missed something, but I'm going to close it for now.

stanfea commented 9 years ago

just for future reference or in case someone finds this on google.

I solved the problem by using gdal to read the hdf files. there seem's to be an issue with hdf-java's native HOpen method when run on hadoop...

update: it was an issue with the native libs not being found, but there was no indicative error for hdf-java, gdal was more explicit so was able to fix it, haven't tried with hdf-java again but 99% sure that was the problem