NICTA / scoobi

A Scala productivity framework for Hadoop.
http://nicta.github.com/scoobi/
482 stars 97 forks source link

Default input file scheme is no more HDFS but local with 0.7.0-RC3 #274

Closed ebastien closed 11 years ago

ebastien commented 11 years ago

After upgrading to 0.7.0-RC3-cdh3-SNAPSHOT, all my relative file paths are failing. I have to specify the complete URLs: hdfs://namenode/myfiles . I am not sure that this is the expected behavior.

etorreborre commented 11 years ago

It reminds me of something we had before. Can you please paste the error message that you get?

ebastien commented 11 years ago

Sorry, I don't have the Hadoop cluster available right now to reproduce. In the mean time, what I can say is that the error message states that the input file does not exist and that is all. If I create a file on the local filesystem with the same relative path, the code loads it instead of looking for it on the HDFS filesystem. In 0.7.0-RC2, whenever I use a file path without an explicit scheme, it looks for it on the default HDFS filesystem. In 0.7.0-RC3, it looks on the local filesystem of the client running the jar.

etorreborre commented 11 years ago

We are about to release 0.7.0, even if this fix is not in, but I propose you this workaround in the meantime:

val list = fromTextFile("path", check = Source.noInputCheck)

At least you should be able to run your code with that.

ebastien commented 11 years ago

Thanks, I'll try that. BTW, I've noticed this commit: 79fe1ddf0cee33e8864f75ecb7895eed8113c839 , that seems to change the default filesystem used with the ClusterConfiguration. Do you think it might explain the behavior I see?

etorreborre commented 11 years ago

This commit just replaces some constants with their values. Actually you can help me debug this issue by tearing apart the default code doing the checking (pathExists) and finding which condition exactly fails (replacing progressively check from Source.noInputCheck to all the conditions in pathExists:

  /** Determine whether a path exists or not. */
  def pathExists(p: Path, pathFilter: PathFilter = hiddenFilePathFilter)(implicit conf: Configuration): Boolean = tryOrElse {
    val fs = FileSystem.get(p.toUri, conf)
    (fs.isFile(p) && fs.exists(p)) || getFileStatus(p, pathFilter).nonEmpty
  }(false)

  /** Get a Set of FileStatus objects for a given Path. */
  def getFileStatus(path: Path, pathFilter: PathFilter = hiddenFilePathFilter)(implicit conf: Configuration): Seq[FileStatus] =
    tryOrElse {
      Option(FileSystem.get(path.toUri, conf).globStatus(new Path(path, "*"), pathFilter)).map(_.toSeq).getOrElse(Seq())
    }(Seq())

  private val hiddenFilePathFilter = new PathFilter {
    def accept(p: Path): Boolean = !p.getName.startsWith("_") && !p.getName.startsWith(".")
  }

Maybe FileSystem.get(path.toUri, conf).globStatus(new Path(path, "*"), pathFilter) doesn't return anything?

I'm sorry to dump the debugging on you but we don't have a CDH3 cluster which I could use for this.

xelax commented 11 years ago

I see the error on my hdp1.2 cluster as well with 0.7.0 final. Here is the stack trace:

Exception in thread "main" java.lang.IllegalArgumentException: Can't instantiate public org.apache.hadoop.io.SequenceFile$Reader(org.apache.hadoop.fs.FileSystem,org.apache.hadoop.fs.Path,org.apache.hadoop.conf.Configuration) th
rows java.io.IOException : null
        at com.nicta.scoobi.impl.util.Compatibility$.newInstance(Compatibility.scala:116)
        at com.nicta.scoobi.impl.util.Compatibility$.newSequenceFileReader(Compatibility.scala:84)
        at com.nicta.scoobi.io.sequence.CheckedSeqSource$$anonfun$checkInputPathType$1.apply(SequenceInput.scala:171)
        at com.nicta.scoobi.io.sequence.CheckedSeqSource$$anonfun$checkInputPathType$1.apply(SequenceInput.scala:170)
        at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
        at com.nicta.scoobi.io.sequence.CheckedSeqSource.checkInputPathType(SequenceInput.scala:170)
        at com.nicta.scoobi.io.sequence.SeqSource$$anonfun$inputCheck$1.apply(SequenceInput.scala:149)
        at com.nicta.scoobi.io.sequence.SeqSource$$anonfun$inputCheck$1.apply(SequenceInput.scala:149)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at com.nicta.scoobi.io.sequence.SeqSource.inputCheck(SequenceInput.scala:149)
        at com.nicta.scoobi.impl.exec.ExecutionMode$$anonfun$checkSourceAndSinks$1$$anonfun$apply$1.apply(ExecutionMode.scala:52)
        at com.nicta.scoobi.impl.exec.ExecutionMode$$anonfun$checkSourceAndSinks$1$$anonfun$apply$1.apply(ExecutionMode.scala:49)
        at org.kiama.attribution.AttributionCore$CachedParamAttribute$$anon$1.apply(AttributionCore.scala:111)
        at com.nicta.scoobi.impl.exec.ExecutionMode$$anonfun$checkSourceAndSinks$1$$anonfun$apply$1$$anonfun$apply$3.apply(ExecutionMode.scala:55)
        at com.nicta.scoobi.impl.exec.ExecutionMode$$anonfun$checkSourceAndSinks$1$$anonfun$apply$1$$anonfun$apply$3.apply(ExecutionMode.scala:55)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at com.nicta.scoobi.impl.exec.ExecutionMode$$anonfun$checkSourceAndSinks$1$$anonfun$apply$1.apply(ExecutionMode.scala:55)
        at com.nicta.scoobi.impl.exec.ExecutionMode$$anonfun$checkSourceAndSinks$1$$anonfun$apply$1.apply(ExecutionMode.scala:49)
        at org.kiama.attribution.AttributionCore$CachedParamAttribute$$anon$1.apply(AttributionCore.scala:111)
        at com.nicta.scoobi.impl.exec.ExecutionMode$$anonfun$checkSourceAndSinks$1$$anonfun$apply$1$$anonfun$apply$3.apply(ExecutionMode.scala:55)
        at com.nicta.scoobi.impl.exec.ExecutionMode$$anonfun$checkSourceAndSinks$1$$anonfun$apply$1$$anonfun$apply$3.apply(ExecutionMode.scala:55)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at com.nicta.scoobi.impl.exec.ExecutionMode$$anonfun$checkSourceAndSinks$1$$anonfun$apply$1.apply(ExecutionMode.scala:55)
        at com.nicta.scoobi.impl.exec.ExecutionMode$$anonfun$checkSourceAndSinks$1$$anonfun$apply$1.apply(ExecutionMode.scala:49)
        at org.kiama.attribution.AttributionCore$CachedParamAttribute$$anon$1.apply(AttributionCore.scala:111)
        at com.nicta.scoobi.impl.exec.ExecutionMode$$anonfun$checkSourceAndSinks$1$$anonfun$apply$1$$anonfun$apply$3.apply(ExecutionMode.scala:55)
        at com.nicta.scoobi.impl.exec.ExecutionMode$$anonfun$checkSourceAndSinks$1$$anonfun$apply$1$$anonfun$apply$3.apply(ExecutionMode.scala:55)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at com.nicta.scoobi.impl.exec.ExecutionMode$$anonfun$checkSourceAndSinks$1$$anonfun$apply$1.apply(ExecutionMode.scala:55)
        at com.nicta.scoobi.impl.exec.ExecutionMode$$anonfun$checkSourceAndSinks$1$$anonfun$apply$1.apply(ExecutionMode.scala:49)
        at org.kiama.attribution.AttributionCore$CachedParamAttribute$$anon$1.apply(AttributionCore.scala:111)
        at com.nicta.scoobi.impl.exec.ExecutionMode$class.prepare(ExecutionMode.scala:41)
        at com.nicta.scoobi.impl.exec.HadoopMode.com$nicta$scoobi$impl$exec$HadoopMode$$super$prepare(HadoopMode.scala:57)
        at com.nicta.scoobi.impl.exec.HadoopMode$$anonfun$prepare$1.apply(HadoopMode.scala:57)
        at com.nicta.scoobi.impl.exec.HadoopMode$$anonfun$prepare$1.apply(HadoopMode.scala:57)
        at com.nicta.scoobi.impl.monitor.Loggable$LoggableObject.evaluated$lzycompute(Loggable.scala:38)
        at com.nicta.scoobi.impl.monitor.Loggable$LoggableObject.evaluated(Loggable.scala:38)
        at com.nicta.scoobi.impl.monitor.Loggable$LoggableObject.debug(Loggable.scala:49)
        at com.nicta.scoobi.impl.monitor.Loggable$LoggableObject.debug(Loggable.scala:48)
        at com.nicta.scoobi.impl.exec.HadoopMode.prepare(HadoopMode.scala:57)
        at com.nicta.scoobi.impl.exec.HadoopMode.execute(HadoopMode.scala:51)
        at com.nicta.scoobi.impl.exec.HadoopMode.execute(HadoopMode.scala:47)
        at com.nicta.scoobi.impl.Persister.persist(Persister.scala:44)
        at com.nicta.scoobi.impl.ScoobiConfigurationImpl.persist(ScoobiConfigurationImpl.scala:320)
        at com.nicta.scoobi.application.Persist$class.persist(Persist.scala:33)
        at com.ebay.scoobi.examples.Sojourner$.persist(Sojourner.scala:15)
        at com.ebay.scoobi.examples.Sojourner$.run(Sojourner.scala:36)
        at com.nicta.scoobi.application.ScoobiApp$$anonfun$main$1.apply$mcV$sp(ScoobiApp.scala:80)
        at com.nicta.scoobi.application.ScoobiApp$$anonfun$main$1.apply(ScoobiApp.scala:75)
        at com.nicta.scoobi.application.ScoobiApp$$anonfun$main$1.apply(ScoobiApp.scala:75)
        at com.nicta.scoobi.application.Hadoop$class.runOnCluster(Hadoop.scala:108)
        at com.ebay.scoobi.examples.Sojourner$.runOnCluster(Sojourner.scala:15)
        at com.nicta.scoobi.application.Hadoop$class.executeOnCluster(Hadoop.scala:65)
        at com.ebay.scoobi.examples.Sojourner$.executeOnCluster(Sojourner.scala:15)
        at com.nicta.scoobi.application.Hadoop$$anonfun$onCluster$1.apply(Hadoop.scala:51)
        at com.nicta.scoobi.application.InMemoryHadoop$class.withTimer(InMemory.scala:72)
        at com.ebay.scoobi.examples.Sojourner$.withTimer(Sojourner.scala:15)
        at com.nicta.scoobi.application.InMemoryHadoop$class.showTime(InMemory.scala:80)
        at com.ebay.scoobi.examples.Sojourner$.showTime(Sojourner.scala:15)
        at com.nicta.scoobi.application.Hadoop$class.onCluster(Hadoop.scala:51)
        at com.ebay.scoobi.examples.Sojourner$.onCluster(Sojourner.scala:15)
        at com.nicta.scoobi.application.Hadoop$class.onHadoop(Hadoop.scala:57)
        at com.ebay.scoobi.examples.Sojourner$.onHadoop(Sojourner.scala:15)
        at com.nicta.scoobi.application.ScoobiApp$class.main(ScoobiApp.scala:75)
        at com.ebay.scoobi.examples.Sojourner$.main(Sojourner.scala:15)
        at com.ebay.scoobi.examples.Sojourner.main(Sojourner.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at com.nicta.scoobi.impl.util.Compatibility$.newInstance(Compatibility.scala:115)
        ... 80 more
Caused by: java.lang.IllegalArgumentException: Wrong FS: hdfs://xxxx:8020/sys/xx/2013/06/20/00/zzz/part-00000, expected: file:///
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
        at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393)
        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
        at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:796)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1479)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1474)
ebastien commented 11 years ago

It might be related, but in my case I only got an error message saying that the input file does not exist. No stack trace on my side...

etorreborre commented 11 years ago

This new issue is related to the Compatibility class I recently introduce to simplify the build w.r.t CDH3/CDH4. My local tests seems to be working but obviously I missed something. I'll fix that on Monday morning will publish a 7.0.1.

etorreborre commented 11 years ago

I take this back. This is indeed the same problem under a different manifestation. Somehow FileSystem.getLength(path) fails because FileSystem is a local one.

xelax commented 11 years ago

No worries, I reverted back to RC2 at work, so enjoy the weekend. Alex On Jun 21, 2013, at 5:34 PM, Eric Torreborre notifications@github.com wrote:

I take this back. This is indeed the same problem under a different manifestation. Somehow FileSystem.getLength(path) fails because FileSystem is a local one.

— Reply to this email directly or view it on GitHub.

etorreborre commented 11 years ago

I think I found the problem, and my apologies to Emmanuel, you were right by mentioning this commit 79fe1dd. I messed up the constant name change between CDH3 and CDH4. Can you please test 0.8.0-cdh3-SNAPSHOT when you have some time? If that works ok I'll publish a 0.7.1.

xelax commented 11 years ago

I ran 0.8.0-cdh3-SNAPSHOT and can confirm that fixed my problem on our cluster. Thanks!

etorreborre commented 11 years ago

Thanks Alex for testing this. I deployed a 0.7.1-cdh4/cdh3 version with the fix.