Closed rbraley closed 10 years ago
Hi,
The error stems due to the lack of a hadoop property that we use to identify the task running. I'm not sure whether Spark sets this or not - it looks like it sets internally when writing to Hadoop but I'm not clear about it being set when reading from Hadoop.
The master tries a different property and additionally gives a long error message with all the properties. Can you please post/upload the properties somewhere so I can take a better look at them?
Thanks.
Here is a properties dump from my environment as described in #144
14/02/24 16:55:58 ERROR EsInputFormat: Cannot determine task id - current properties are {mapred.task.cache.levels=2, ha.failover-controller.cli-check.rpc-timeout.ms=20000, mapred.job.restart.recover=true, ipc.client.connect.max.retries.on.timeouts=45, map.sort.class=org.apache.hadoop.util.QuickSort, hadoop.tmp.dir=/tmp/hadoop-${user.name}, es.internal.mr.target.resource=logstash-batanga-radio-2014.02.18/logs, ha.health-monitor.check-interval.ms=1000, ipc.client.idlethreshold=4000, mapred.system.dir=${hadoop.tmp.dir}/mapred/system, kfs.blocksize=67108864, fs.trash.checkpoint.interval=0, mapred.job.tracker.persist.jobstatus.hours=0, io.skip.checksum.errors=false, mapred.cluster.reduce.memory.mb=-1, mapred.child.tmp=./tmp, es.internal.es.version=0.90.5, mapred.skip.reduce.max.skip.groups=0, mapred.heartbeats.in.second=100, mapred.jobtracker.instrumentation=org.apache.hadoop.mapred.JobTrackerMetricsInst, mapred.tasktracker.dns.nameserver=default, fs.defaultFS=file:///, io.sort.factor=10, mapred.task.timeout=600000, mapred.max.tracker.failures=4, hadoop.rpc.socket.factory.class.default=org.apache.hadoop.net.StandardSocketFactory, mapred.job.tracker.jobhistory.lru.cache.size=5, kfs.replication=3, mapred.skip.map.auto.incr.proc.count=true, mapreduce.job.complete.cancel.delegation.tokens=true, io.mapfile.bloom.size=1048576, hadoop.rpc.protection=authentication, es.query=
{
"query": {
"match_all": {}
},
"fields": [
"btng.listenerdjid"
]
}
, mapreduce.reduce.shuffle.connect.timeout=180000, hadoop.ssl.require.client.cert=false, hadoop.skip.worker.version.check=false, tasktracker.http.threads=40, mapred.job.shuffle.merge.percent=0.66, io.bytes.per.checksum=512, mapred.output.compress=false, mapred.healthChecker.script.timeout=600000, file.stream-buffer-size=4096, ha.failover-controller.new-active.rpc-timeout.ms=60000, mapred.reduce.slowstart.completed.maps=0.05, mapred.reduce.max.attempts=4, es.ser.reader.value.class=org.elasticsearch.hadoop.mr.WritableValueReader, ha.zookeeper.acl=world:anyone:rwcda, mapreduce.ifile.readahead.bytes=4194304, fs.ftp.host.port=21, mapred.skip.map.max.skip.records=0, kfs.client-write-packet-size=65536, kfs.bytes-per-checksum=512, mapred.cluster.map.memory.mb=-1, hadoop.security.group.mapping=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback, hadoop.ssl.keystores.factory.class=org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory, s3.replication=3, net.topology.node.switch.mapping.impl=org.apache.hadoop.net.ScriptBasedMapping, mapred.job.tracker.persist.jobstatus.dir=/jobtracker/jobsInfo, fs.s3.buffer.dir=${hadoop.tmp.dir}/s3, job.end.retry.attempts=0, s3native.bytes-per-checksum=512, mapred.local.dir.minspacestart=0, mapred.output.compression.type=RECORD, s3.client-write-packet-size=65536, io.mapfile.bloom.error.rate=0.005, ftp.bytes-per-checksum=512, mapred.cluster.max.reduce.memory.mb=-1, mapred.max.tracker.blacklists=4, mapred.task.profile.maps=0-2, hadoop.security.group.mapping.ldap.search.attr.group.name=cn, mapred.userlog.retain.hours=24, ha.health-monitor.rpc-timeout.ms=45000, mapred.job.tracker.persist.jobstatus.active=false, hadoop.security.authorization=false, local.cache.size=10737418240, s3.bytes-per-checksum=512, mapreduce.shuffle.ssl.enabled=${hadoop.ssl.enabled}, mapred.min.split.size=0, mapred.map.tasks=2, mapred.child.java.opts=-Xmx200m, mapred.map.child.log.level=INFO, mapred.job.queue.name=default, mapred.job.tracker.retiredjobs.cache.size=1000, ipc.server.listen.queue.size=128, mapred.inmem.merge.threshold=1000, job.end.retry.interval=30000, mapred.skip.attempts.to.start.skipping=2, s3native.blocksize=67108864, mapred.reduce.tasks=1, mapred.merge.recordsBeforeProgress=10000, mapred.userlog.limit.kb=0, file.replication=1, mapred.job.reduce.memory.mb=-1, ftp.client-write-packet-size=65536, hadoop.work.around.non.threadsafe.getpwuid=false, mapred.job.shuffle.input.buffer.percent=0.70, io.sort.spill.percent=0.80, mapreduce.shuffle.ssl.port=50443, hadoop.http.staticuser.user=dr.who, mapred.map.tasks.speculative.execution=true, hadoop.http.authentication.type=simple, hadoop.util.hash.type=murmur, hadoop.security.instrumentation.requires.admin=false, mapred.map.max.attempts=4, mapreduce.job.acl-view-job= , mapreduce.ifile.readahead=true, io.map.index.interval=128, mapred.job.tracker.handler.count=10, mapreduce.reduce.shuffle.read.timeout=180000, mapred.tasktracker.expiry.interval=600000, hadoop.ssl.client.conf=ssl-client.xml, mapred.reduce.child.log.level=INFO, mapred.jobtracker.maxtasks.per.job=-1, mapred.jobtracker.job.history.block.size=3145728, keep.failed.task.files=false, hadoop.kerberos.kinit.command=kinit, ipc.client.tcpnodelay=false, mapred.task.profile.reduces=0-2, fs.AbstractFileSystem.hdfs.impl=org.apache.hadoop.fs.Hdfs, mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec, io.map.index.skip=0, hadoop.http.authentication.token.validity=36000, ipc.server.tcpnodelay=false, hadoop.jetty.logs.serve.aliases=true, ftp.replication=3, ha.failover-controller.graceful-fence.connection.retries=1, jobclient.progress.monitor.poll.interval=1000, ha.health-monitor.sleep-after-disconnect.ms=1000, es.resource=logstash-batanga-radio-2014.02.18/logs, mapred.job.map.memory.mb=-1, file.client-write-packet-size=65536, mapred.reduce.tasks.speculative.execution=true, fs.AbstractFileSystem.viewfs.impl=org.apache.hadoop.fs.viewfs.ViewFs, hadoop.security.group.mapping.ldap.search.filter.group=(objectClass=group), mapreduce.tasktracker.outofband.heartbeat=false, mapreduce.reduce.input.limit=-1, fs.s3n.block.size=67108864, net.topology.script.number.args=100, dfs.ha.fencing.ssh.connect-timeout=30000, hadoop.security.authentication=simple, tfile.fs.output.buffer.size=262144, mapred.job.reuse.jvm.num.tasks=1, mapred.jobtracker.completeuserjobs.maximum=100, hadoop.security.groups.cache.secs=300, ha.failover-controller.graceful-fence.rpc-timeout.ms=5000, fs.AbstractFileSystem.file.impl=org.apache.hadoop.fs.local.LocalFs, mapred.task.tracker.task-controller=org.apache.hadoop.mapred.DefaultTaskController, ha.health-monitor.connect-retry-interval.ms=1000, kfs.stream-buffer-size=4096, fs.s3.maxRetries=4, mapred.cluster.max.map.memory.mb=-1, file.blocksize=67108864, mapreduce.reduce.shuffle.maxfetchfailures=10, fs.ftp.host=0.0.0.0, file.bytes-per-checksum=512, ha.zookeeper.parent-znode=/hadoop-ha, mapreduce.job.acl-modify-job= , mapred.local.dir=${hadoop.tmp.dir}/mapred/local, fs.s3.sleepTimeSeconds=10, fs.trash.interval=0, mapred.submit.replication=10, hadoop.relaxed.worker.version.check=true, mapred.map.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec, mapred.tasktracker.dns.interface=default, ftp.stream-buffer-size=4096, mapred.job.tracker=local, hadoop.http.authentication.signature.secret.file=${user.home}/hadoop-http-auth-signature-secret, io.seqfile.sorter.recordlimit=1000000, s3.blocksize=67108864, mapreduce.tasktracker.cache.local.numberdirectories=10000, mapred.jobtracker.taskScheduler=org.apache.hadoop.mapred.JobQueueTaskScheduler, mapred.line.input.format.linespermap=1, fs.permissions.umask-mode=022, mapred.tasktracker.instrumentation=org.apache.hadoop.mapred.TaskTrackerMetricsInst, hadoop.ssl.server.conf=ssl-server.xml, mapreduce.jobtracker.split.metainfo.maxsize=10000000, jobclient.completion.poll.interval=5000, mapred.local.dir.minspacekill=0, s3native.stream-buffer-size=4096, io.sort.record.percent=0.05, hadoop.http.authentication.kerberos.principal=HTTP/_HOST@LOCALHOST, mapred.temp.dir=${hadoop.tmp.dir}/mapred/temp, mapred.tasktracker.reduce.tasks.maximum=2, mapred.tasktracker.tasks.sleeptime-before-sigkill=5000, mapred.job.reduce.input.buffer.percent=0.0, mapred.tasktracker.indexcache.mb=10, es.internal.hosts=, hadoop.security.group.mapping.ldap.search.filter.user=(&(objectClass=user)(sAMAccountName={0})), fs.automatic.close=true, mapreduce.job.map.output.collector.class=org.apache.hadoop.mapred.MapTask$MapOutputBuffer, mapred.skip.reduce.auto.incr.proc.count=true, s3.stream-buffer-size=4096, ha.zookeeper.session-timeout.ms=5000, io.seqfile.compress.blocksize=1000000, hadoop.http.filter.initializers=org.apache.hadoop.http.lib.StaticUserWebFilter, fs.s3.block.size=67108864, mapred.tasktracker.taskmemorymanager.monitoring-interval=5000, hadoop.http.authentication.simple.anonymous.allowed=true, mapred.acls.enabled=false, mapred.queue.default.state=RUNNING, mapreduce.jobtracker.staging.root.dir=${hadoop.tmp.dir}/mapred/staging, ftp.blocksize=67108864, mapreduce.shuffle.ssl.address=0.0.0.0, mapred.queue.names=default, mapred.task.tracker.http.address=0.0.0.0:50060, mapred.disk.healthChecker.interval=60000, mapred.reduce.parallel.copies=5, io.seqfile.lazydecompress=true, hadoop.common.configuration.version=0.23.0, hadoop.ssl.enabled=false, hadoop.security.group.mapping.ldap.search.attr.member=member, io.sort.mb=100, ipc.client.connection.maxidletime=10000, mapred.task.tracker.report.address=127.0.0.1:0, mapred.compress.map.output=false, hadoop.security.uid.cache.secs=14400, mapred.healthChecker.interval=60000, ipc.client.kill.max=10, ipc.client.connect.max.retries=10, io.seqfile.local.dir=${hadoop.tmp.dir}/io/local, mapred.user.jobconf.limit=5242880, mapreduce.job.reduce.shuffle.consumer.plugin.class=org.apache.hadoop.mapred.ReduceTask$ReduceCopier, io.native.lib.available=true, mapred.job.tracker.http.address=0.0.0.0:50030, io.file.buffer.size=4096, mapred.jobtracker.restart.recover=false, io.serializations=org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization,org.apache.hadoop.io.serializer.avro.AvroReflectSerialization, tfile.fs.input.buffer.size=262144, mapred.task.profile=false, hadoop.security.group.mapping.ldap.ssl=false, jobclient.output.filter=FAILED, fs.df.interval=60000, s3native.client-write-packet-size=65536, hadoop.http.authentication.kerberos.keytab=${user.home}/hadoop.keytab, s3native.replication=3, mapred.tasktracker.map.tasks.maximum=2, tfile.io.chunk.size=1048576, hadoop.ssl.hostname.verifier=DEFAULT}
Thanks. I've pushed a fix to master which ignores the task id in heartbeat (since it's an optional thing). However it looks to me like a bug since several Hadoop properties should be present but they're not. I'll probably raise this on the Spark mailing list at some point (if you guys do that in the meantime it would be even better).
Either way, please try the latest master and let me know how it goes. Thanks!
@fedesilva @rbraley guys, if you are available in 2h or so maybe we can connect on IRC to speed things up? Just in case there are still bugs, we can chat directly and I'll do my best to fix the issues as they appear.
Let me know if that works for you guys.
The last master fixes the NPE. I have succesfully iterated over an RDD.
I am now getting this, that does not seem to abort the tasks.
14/02/24 18:47:56 WARN HadoopRDD: Exception in RecordReader.close()
java.lang.NullPointerException
at org.elasticsearch.hadoop.mr.ReportingUtils.report(ReportingUtils.java:38)
at org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.close(EsInputFormat.java:274)
at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:174)
at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:57)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Regarding IRC: Ok, I'm online in elasticsearch IRC channel right now as fedesilva and will be available until 23 GMT. I am GMT -3 (Montevideo).
Can you please try the latest master. This should address the NPE (since the reporting facility seems to be disabled).
On 2/24/2014 8:53 PM, federico silva wrote:
The last master fixes the NPE. I have succesfully iterated over an RDD.
I am now getting this, that does not seem to abort the tasks.
14/02/24 18:47:56 WARN HadoopRDD: Exception in RecordReader.close() java.lang.NullPointerException at org.elasticsearch.hadoop.mr.ReportingUtils.report(ReportingUtils.java:38) at org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.close(EsInputFormat.java:274) at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:174) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:57) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) — Reply to this email directly or view it on GitHub https://github.com/elasticsearch/elasticsearch-hadoop/issues/148#issuecomment-35920939.
Costin
Cool - see you in a bit (my nick is typically costin
) :)
On 2/24/2014 8:57 PM, federico silva wrote:
Regarding IRC: Ok, I'm online in es IRC right now as fedesilva and will be available until 23 GMT. I am GMT -3 (Montevideo).
— Reply to this email directly or view it on GitHub https://github.com/elasticsearch/elasticsearch-hadoop/issues/148#issuecomment-35921488.
Costin
Ok, see you soon.
As this is fixed, I'm closing the issue. Guys, please continue creating issues if you encounter any problems but do mention the new issue under #151
Hi guys #152 is also related to this. I will hang in irc as rbraley for a while to try to resolve issues together.
On Tue, Feb 25, 2014 at 3:25 AM, Costin Leau notifications@github.comwrote:
Closed #148https://github.com/elasticsearch/elasticsearch-hadoop/issues/148 .
Reply to this email directly or view it on GitHubhttps://github.com/elasticsearch/elasticsearch-hadoop/issues/148 .
Ryan Braley | Founder http://traintracks.io/ http://www.traintracks.io/
US: +1 (206) 866 5661 CN: +86 156 1153 7598 Coding the future. Decoding the game.
Not sure if I am missing something but I thought this error might be upstream. Here is how to repeat.
build.sbt
SimpleApp.scala
Here's the stacktrace