Job failed with state FAILED due to: NA

marklit commented 5 years ago

I'm running a single machine with 24 GB of RAM, 1 TB of disk, Ubuntu 16.04.2 LTS, Hadoop 2.8.3 with HDFS setup, Zookeeper 3.8.4, Druid 0.13.0 and MySQL and the MySQL Connector 5.1.28. I used https://tech.marksblogg.com/hadoop-3-single-node-install-guide.html for my Hadoop installation steps with the exception of using hadoop 2.8.3 instead of 3.0.3.

Hadoop's name node runs on port 9000.

$ sudo vi /opt/hadoop/etc/hadoop/core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000/</value>
    </property>
    <property>
        <name>fs.default.FS</name>
        <value>hdfs://localhost:9000/</value>
    </property>
</configuration>

These are my config file changes from the stock config files distributed with Druid:

/opt/druid/conf/druid/middleManager/runtime.properties

druid.indexer.runner.javaOpts=-server -Xmx4g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

/opt/druid/conf/druid/_common/common.runtime.properties

druid.extensions.loadList=["mysql-metadata-storage", "druid-hdfs-storage", "druid-histogram", "druid-datasketches", "druid-lookups-cached-global"]
druid.startup.logging.logProperties=true
druid.zk.service.host=127.0.0.1
druid.zk.paths.base=/druid
druid.metadata.storage.type=mysql
druid.metadata.storage.connector.connectURI=jdbc:mysql://localhost:3306/druid
druid.metadata.storage.connector.user=druid
druid.metadata.storage.connector.password=druid
druid.storage.type=hdfs
druid.storage.storageDirectory=/druid/segments
druid.indexer.logs.type=hdfs
druid.indexer.logs.directory=/druid/indexing-logs
druid.selectors.indexing.serviceName=druid/overlord
druid.selectors.coordinator.serviceName=druid/coordinator
druid.monitoring.monitors=["org.apache.druid.java.util.metrics.JvmMonitor"]
druid.emitter=logging
druid.emitter.logging.logLevel=info
druid.indexing.doubleStorage=double

/opt/druid/conf/druid/broker/jvm.config

-XX:MaxDirectMemorySize=8192m

/opt/druid/conf/druid/historical/jvm.config

-XX:MaxDirectMemorySize=8192m

I've setup segments storage on HDFS.

$ hdfs dfs -mkdir -p /druid/segments

I've copied over the config files from Hadoop to Druid:

$ sudo cp /opt/hadoop/etc/hadoop/{core,hdfs,yarn}-site.xml \
              /opt/druid/conf/druid/_common/

I've used the following to launch each of the nodes:

$ sudo /opt/druid/bin/middleManager.sh start
$ sudo /opt/druid/bin/coordinator.sh   start
$ sudo /opt/druid/bin/broker.sh        start
$ sudo /opt/druid/bin/overlord.sh      start
$ sudo /opt/druid/bin/historical.sh    start

This is the index file I've created:

$ vi index.json

{
    "spec": {
        "dataSchema": {
            "dataSource": "sample",
            "granularitySpec": {
                "intervals": [
                    "2010-09-12/2018-09-13"
                ],
                "queryGranularity": "none",
                "segmentGranularity": "day",
                "type": "uniform"
            },
            "metricsSpec": [
                {
                    "name": "count",
                    "type": "count"
                }
            ],
            "parser": {
                "parseSpec": {
                    "columns": [
                        "timestamp",
                        "page",
                        "language",
                        "user",
                        "unpatrolled",
                        "newPage",
                        "robot",
                        "anonymous",
                        "namespace",
                        "continent",
                        "country",
                        "region",
                        "city",
                        "added",
                        "deleted",
                        "delta"
                    ],
                    "dimensionsSpec": {
                        "dimensions": [
                            "page",
                            "language",
                            "user",
                            "unpatrolled",
                            "newPage",
                            "robot",
                            "anonymous",
                            "namespace",
                            "continent",
                            "country",
                            "region",
                            "city"
                        ]
                    },
                    "format": "csv",
                    "timestampSpec": {
                        "column": "timestamp"
                    }
                },
                "type": "hadoopyString"
            }
        },
        "ioConfig": {
            "inputSpec": {
                "paths": "hdfs://localhost:9000/sample.csv",
                "type": "static"
            },
            "type": "hadoop"
        },
        "tuningConfig": {
            "jobProperties": {},
            "partitionsSpec": {
                "targetPartitionSize": 5000000,
                "type": "hashed"
            },
            "type": "hadoop"
        }
    },
    "type": "index_hadoop"
}

This is the sample CSV file I created and uploaded to HDFS:

$ vi sample.csv

This is the contents of that file:

2013-08-31T01:02:33Z,"Gypsy Danger","en","nuclear","true","true","false","false","article","North America","United States","Bay Area","San Francisco",57,200,-143
2013-08-31T03:32:45Z,"Striker Eureka","en","speed","false","true","true","false","wikipedia","Australia","Australia","Cantebury","Syndey",459,129,330
2013-08-31T07:11:21Z,"Cherno Alpha","ru","masterYi","false","true","true","false","article","Asia","Russia","Oblast","Moscow",123,12,111
2013-08-31T11:58:39Z,"Crimson Typhoon","zh","triplets","true","false","true","false","wikipedia","Asia","China","Shanxi","Taiyuan",905,5,900
2013-08-31T12:41:27Z,"Coyote Tango","ja","cancer","true","false","true","false","wikipedia","Asia","Japan","Kanto","Tokyo",1,10,-9

This is the command I used to upload the CSV to HDFS:

$ hdfs dfs -copyFromLocal sample.csv /

I'm able to submit the job and see it running for some time in the Web UI / console.

$ wget -qO- \
     --post-file=index.json \
     --header=Content-Type:application/json \
     http://localhost:8090/druid/indexer/v1/task

{"task":"index_hadoop_sample_2019-01-30T08:42:33.753Z"}

After some time the job fails with the following:

INFO [pool-34-thread-1] org.apache.hadoop.mapred.LocalJobRunner - Finishing task: attempt_local1651148817_0001_r_002922_0
INFO [Thread-142] org.apache.hadoop.mapred.LocalJobRunner - reduce task executor complete.
WARN [Thread-142] org.apache.hadoop.mapred.LocalJobRunner - job_local1651148817_0001
java.lang.Exception: java.io.IOException: The stream is closed
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489) ~[hadoop-mapreduce-client-common-2.8.3.jar:?]
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:556) [hadoop-mapreduce-client-common-2.8.3.jar:?]
Caused by: java.io.IOException: The stream is closed
    at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:118) ~[hadoop-common-2.8.3.jar:?]
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) ~[?:1.8.0_201]
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) ~[?:1.8.0_201]
    at java.io.DataOutputStream.flush(DataOutputStream.java:123) ~[?:1.8.0_201]
    at java.io.FilterOutputStream.close(FilterOutputStream.java:158) ~[?:1.8.0_201]
    at org.apache.hadoop.hdfs.DataStreamer.closeStream(DataStreamer.java:987) ~[hadoop-hdfs-client-2.8.3.jar:?]
    at org.apache.hadoop.hdfs.DataStreamer.closeInternal(DataStreamer.java:839) ~[hadoop-hdfs-client-2.8.3.jar:?]
    at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:834) ~[hadoop-hdfs-client-2.8.3.jar:?]
    Suppressed: java.io.IOException: The stream is closed
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:118) ~[hadoop-common-2.8.3.jar:?]
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) ~[?:1.8.0_201]
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) ~[?:1.8.0_201]
        at java.io.FilterOutputStream.close(FilterOutputStream.java:158) ~[?:1.8.0_201]
        at java.io.FilterOutputStream.close(FilterOutputStream.java:159) ~[?:1.8.0_201]
        at org.apache.hadoop.hdfs.DataStreamer.closeStream(DataStreamer.java:987) ~[hadoop-hdfs-client-2.8.3.jar:?]
        at org.apache.hadoop.hdfs.DataStreamer.closeInternal(DataStreamer.java:839) ~[hadoop-hdfs-client-2.8.3.jar:?]
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:834) ~[hadoop-hdfs-client-2.8.3.jar:?]
INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - Job job_local1651148817_0001 failed with state FAILED due to: NA
INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - Counters: 36
    File System Counters
        FILE: Number of bytes read=317393106818
        FILE: Number of bytes written=384764953180
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=380050873716
        HDFS: Number of bytes written=67748171945
        HDFS: Number of read operations=28543346
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=17085517
    Map-Reduce Framework
        Map input records=5
        Map output records=2923
        Map output bytes=55552
        Map output materialized bytes=78936
        Input split bytes=287
        Combine input records=0
        Combine output records=0
        Reduce input groups=2609
        Reduce shuffle bytes=78936
        Reduce input records=2609
        Reduce output records=0
        Spilled Records=2923
        Shuffled Maps =2923
        Failed Shuffles=0
        Merged Map outputs=2923
        GC time elapsed (ms)=9353
        Total committed heap usage (bytes)=5543438581760
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    org.apache.druid.indexer.HadoopDruidIndexerConfig$IndexJobCounters
        ROWS_PROCESSED_COUNTER=5
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=274762
ERROR [task-runner-0-priority-0] org.apache.druid.indexer.DetermineHashedPartitionsJob - Job failed: job_local1651148817_0001
INFO [task-runner-0-priority-0] org.apache.druid.indexer.JobHelper - Deleting path[var/druid/hadoop-tmp/sample/2019-02-03T112810.174Z_351fb928b8c4411d993fd996afc59467]
INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.firehose.ServiceAnnouncingChatHandlerProvider - Unregistering chat handler[index_hadoop_sample_2019-02-03T11:28:10.225Z]
INFO [task-runner-0-priority-0] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_hadoop_sample_2019-02-03T11:28:10.225Z] status changed to [FAILED].
INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_hadoop_sample_2019-02-03T11:28:10.225Z",
  "status" : "FAILED",
  "duration" : 337797,
  "errorMsg" : "{}"
}

This is what top reports after the job has been running for a minute:

top - 03:36:51 up  9:59,  3 users,  load average: 4.78, 3.37, 1.35
Tasks: 209 total,   1 running, 208 sleeping,   0 stopped,   0 zombie
%Cpu0  : 49.0 us,  6.1 sy,  0.0 ni, 40.8 id,  4.1 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 47.2 us,  9.4 sy,  0.0 ni, 39.6 id,  1.9 wa,  0.0 hi,  1.9 si,  0.0 st
%Cpu2  :  6.8 us, 18.2 sy,  0.0 ni, 72.7 id,  2.3 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  6.1 us, 24.5 sy,  0.0 ni, 59.2 id,  4.1 wa,  0.0 hi,  6.1 si,  0.0 st
KiB Mem : 24672624 total,   889856 free, 19689536 used,  4093232 buff/cache
KiB Swap: 25163772 total, 25163772 free,        0 used.  4554632 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 69027 root      20   0 32.830g 5.788g  25616 S   0.0 24.6   2:26.66 java -server -Xms24g -Xmx24g -XX:MaxDirectMemorySize=8192m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=var/tmp -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -cp conf/druid/_common:conf/+
 69059 root      20   0 16.205g 5.570g  25436 S   0.0 23.7   1:39.58 java -server -Xms8g -Xmx8g -XX:MaxDirectMemorySize=8192m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=var/tmp -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -cp conf/druid/_common:conf/dr+
 20063 root      20   0 9873.9m 3.731g  28316 S 158.8 15.9   2:18.07 java -cp conf/druid/_common:conf/druid/middleManager:lib/compress-lzf-1.0.4.jar:lib/netty-codec-http-4.1.29.Final.jar:lib/commons-lang-2.6.jar:lib/json-smart-2.3.jar:lib/joni-2.1.11.jar:lib/commons-collections4-4.1.jar:lib+
 69017 root      20   0 6879752 989896  25676 S   0.0  4.0   2:05.48 java -server -Xms3g -Xmx3g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=var/tmp -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Dderby.stream.error.file=var/druid/derby.log -cp conf/druid+
 69044 root      20   0 6866948 969064  25408 S   0.0  3.9   1:55.59 java -server -Xms3g -Xmx3g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=var/tmp -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -cp conf/druid/_common:conf/druid/overlord:lib/*: org.apache+
 10894 root      20   0 2756832 489916  23496 S  11.8  2.0   2:07.88 /usr/lib/jvm/java-8-oracle/bin/java -Dproc_namenode -Xmx1000m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop -Dhadoop.id.str=root -Dhadoop.root.+
 11044 root      20   0 2796276 411704  23452 S   7.8  1.7   1:46.49 /usr/lib/jvm/java-8-oracle/bin/java -Dproc_datanode -Xmx1000m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop -Dhadoop.id.str=root -Dhadoop.root.+
 11235 root      20   0 2718580 384356  23728 S   0.0  1.6   0:46.18 /usr/lib/jvm/java-8-oracle/bin/java -Dproc_secondarynamenode -Xmx1000m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop -Dhadoop.id.str=root -Dhad+
 69112 root      20   0 3654236 234740  25816 S   5.9  1.0   1:39.83 java -server -Xms64m -Xmx64m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=var/tmp -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -cp conf/druid/_common:conf/druid/middleManager:lib/*: org+
 10491 zookeep+  20   0 9043560 187464  16892 S   0.0  0.8   0:34.56 /usr/bin/java -cp /etc/zookeeper/conf:/usr/share/java/jline.jar:/usr/share/java/log4j-1.2.jar:/usr/share/java/xercesImpl.jar:/usr/share/java/xmlParserAPIs.jar:/usr/share/java/netty.jar:/usr/share/java/slf4j-api.jar:/usr/sh+
  7176 mysql     20   0 1830532 187300  17880 S   0.0  0.8   0:23.27 /usr/sbin/mysqld

I've also noticed the following returns 405 Method Not Allowed

$ sudo /opt/druid/bin/dsql

\d

Any ideas what might be wrong?

gianm commented 5 years ago

Hey @marklit, a couple of comments:

For the dsql issue, you probably have SQL disabled; try setting druid.sql.enable = true in your common runtime properties.
Your Hadoop task log is showing org.apache.hadoop.mapred.LocalJobRunner -- it is not using YARN, it's running in-process. The in-process runner is pretty inefficient and can be really slow. You probably need mapred-site.xml copied over too. It has a mapreduce.framework.name parameter that controls whether the local or yarn runner gets used.

stale[bot] commented 4 years ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

apache / druid

Job failed with state FAILED due to: NA #6997