Exception thrown in DatasourceInputFormat when trying to find location of splits

itaiy commented 8 years ago

We have Druid 0.9.1.1 installed on EC2, and we're ingesting data in batch (via Hadoop on EMR). Our files are in Parquet format, located on S3. We're also using thetaSketch aggregation. We have druid-s3-extenstions, druid-datasketches, druid-avro-extensions and druid-parquet-extensions enabled.

The indexing task finished successfully, but we see the following exception in its log :

2016-08-01T08:00:00,930 ERROR [task-runner-0-priority-0] io.druid.indexer.hadoop.DatasourceInputFormat - Exception thrown finding location of splits
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2016-07-20T00:00:00.000Z_2016-07-21T00:00:00.000Z
    at org.apache.hadoop.fs.Path.initialize(Path.java:206) ~[hadoop-common-2.3.0.jar:?]
    at org.apache.hadoop.fs.Path.<init>(Path.java:172) ~[hadoop-common-2.3.0.jar:?]
    at org.apache.hadoop.fs.Path.<init>(Path.java:94) ~[hadoop-common-2.3.0.jar:?]
    at org.apache.hadoop.fs.Globber.glob(Globber.java:201) ~[hadoop-common-2.3.0.jar:?]
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1643) ~[hadoop-common-2.3.0.jar:?]
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:222) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
    at io.druid.indexer.hadoop.DatasourceInputFormat.getFrequentLocations(DatasourceInputFormat.java:188) ~[druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
    at io.druid.indexer.hadoop.DatasourceInputFormat.toDataSourceSplit(DatasourceInputFormat.java:171) [druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
    at io.druid.indexer.hadoop.DatasourceInputFormat.getSplits(DatasourceInputFormat.java:122) [druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
    at org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115) [hadoop-mapreduce-client-core-2.3.0.jar:?]
    at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:493) [hadoop-mapreduce-client-core-2.3.0.jar:?]
    at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:510) [hadoop-mapreduce-client-core-2.3.0.jar:?]
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394) [hadoop-mapreduce-client-core-2.3.0.jar:?]
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285) [hadoop-mapreduce-client-core-2.3.0.jar:?]
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282) [hadoop-mapreduce-client-core-2.3.0.jar:?]
    at java.security.AccessController.doPrivileged(Native Method) ~[?:1.7.0_101]
    at javax.security.auth.Subject.doAs(Subject.java:415) [?:1.7.0_101]
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) [hadoop-common-2.3.0.jar:?]
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282) [hadoop-mapreduce-client-core-2.3.0.jar:?]
    at io.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:199) [druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
    at io.druid.indexer.JobHelper.runJobs(JobHelper.java:323) [druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
    at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:94) [druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
    at io.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessing.runTask(HadoopIndexTask.java:261) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.7.0_101]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) ~[?:1.7.0_101]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.7.0_101]
    at java.lang.reflect.Method.invoke(Method.java:606) ~[?:1.7.0_101]
    at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
    at io.druid.indexing.common.task.HadoopIndexTask.run(HadoopIndexTask.java:208) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
    at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
    at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
    at java.util.concurrent.FutureTask.run(FutureTask.java:262) [?:1.7.0_101]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_101]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_101]
    at java.lang.Thread.run(Thread.java:745) [?:1.7.0_101]
Caused by: java.net.URISyntaxException: Relative path in absolute URI: 2016-07-20T00:00:00.000Z_2016-07-21T00:00:00.000Z
    at java.net.URI.checkPath(URI.java:1804) ~[?:1.7.0_101]
    at java.net.URI.<init>(URI.java:752) ~[?:1.7.0_101]
    at org.apache.hadoop.fs.Path.initialize(Path.java:203) ~[hadoop-common-2.3.0.jar:?]
    ... 35 more

The task request is as follows :

{  
   "type":"index_hadoop",
   "spec":{  
      "ioConfig":{  
         "type":"hadoop",
         "inputSpec":{  
            "type" : "multi",
            "children": [
                {
                    "type" : "dataSource",
                    "ingestionSpec" : {
                      "dataSource": "target_test7",
                      "intervals": ["2016-07-02/2016-08-02"]
                    }
                },
                {
                    "type":"static",
                    "inputFormat":"io.druid.data.input.parquet.DruidParquetInputFormat",
                    "paths":"s3n://my-bucket/my-key/2016-07-20-10-49-09"
                }
            ]
         }
      },
      "dataSchema":{  
         "dataSource":"target_test7",
         "parser":{  
            "type":"parquet",
            "parseSpec":{  
               "format":"timeAndDims",
               "timestampSpec":{  
                  "column":"timestamp",
                  "format":"auto"
               },
               "dimensionsSpec":{  
                  "dimensions":[  
                     "segment"
                  ]
               }
            }
         },
         "metricsSpec":[  
            {
                "type" : "thetaSketch",
                "name" : "user_id_sketch",
                "fieldName" : "uid",
                "size" : 65536
            }
         ],
         "granularitySpec":{  
            "type":"uniform",
            "segmentGranularity":"day",
            "queryGranularity":"NONE",
            "intervals":[  
               "2016-07-02/2016-08-02"
            ]
         }
      },
      "tuningConfig":{  
         "type":"hadoop",
         "partitionsSpec":{  
            "type": "hashed",
            "targetPartitionSize" : -1,
            "numShards" : 1
         },
         "jobProperties":{  
            "fs.s3a.connection.maximum": "50",
            "fs.s3.awsAccessKeyId" : "XXXXXXX",
            "fs.s3.awsSecretAccessKey" : "YYYYYYYY",
            "fs.s3.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
            "fs.s3n.awsAccessKeyId" : "XXXXXXX",
            "fs.s3n.awsSecretAccessKey" : "YYYYYYYY",
            "fs.s3n.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
            "mapred.input.fileinputformat.input.dir.recursive": "true"
         },
         "leaveIntermediate":true
      }
   }
}

iNikem commented 8 years ago

Similar issue here. Any suggestions?

Saurabh111191 commented 7 years ago

Same issue occurring to me

itaiy commented 7 years ago

Update : we have the same issue with CSV input as well, but that doesn't cause the indexing to fail. As for Parquet as input - it's possible that some of the fixes applied in the next version (0.9.2) will help overcoming issues with Parquet indexing (e.g #3179 ).

navis commented 7 years ago

@iNikem @Saurabh111191 @itaiy Could you check it's still throws exception with #3544? I'm not using s3 and cannot test that.

itaiy commented 7 years ago

@iNikem @Saurabh111191 - FYI: we've managed to index Parquest files with Druid 0.9.2-RC1 (still getting the "Exception thrown finding location of splits", but indexing works).

@navis - Thanks for #3544! As for re-checking it - I believe it was not backported to the version we're using (0.9.2-RC1), right?

navis commented 7 years ago

@itaiy If it's confirmed to be fixed, we can include this into RC2 :)

sidnakoppa commented 7 years ago

Hi ,Is the issue fixed ,even I am facing the same while delta ingesting in druid-0.9.1.1-version i.e., When I went through logs I found that the error is thrown while reading the segments,though the path value in the retrieved json segment is absolute. "inputSpec" : { "type" : "dataSource", "ingestionSpec" : { "dataSource": "count", "intervals": ["2015-09-16/2015-09-17"] } } } I tried both local and hdfs i.e., csv and json.same issue

erikdubbelboer commented 7 years ago

The issue is the : character in your segment path (in 2016-07-20T00:00:00.000Z_2016-07-21T00:00:00.000Z). I'm not sure what you used to generate the segments. Normally druid doesn't use : for segments stored in HDFS.

stale[bot] commented 5 years ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

apache / druid

Exception thrown in DatasourceInputFormat when trying to find location of splits #3306