divolte / divolte-collector

Divolte Collector
https://divolte.io/
Apache License 2.0
283 stars 77 forks source link

Divolte not writing to HDFS #271

Open tatianafrank opened 5 years ago

tatianafrank commented 5 years ago

My kafka sink is working but my HDFS sink is not working. Im using hdfs 2.0 so that might be why? Ive got divolte running in a docker container and a hadoop cluster running in the same docker-compose network which I got from https://github.com/big-data-europe/docker-hadoop

here is are the relevant parts of my divolte-collector.conf (some parts stripped for brevity):

hdfs {
      enabled = true
      enabled = ${?DIVOLTE_HDFS_ENABLED}
      threads = 2
      buffer_size = 1048576

      client {
        fs.DEFAULT_FS = "hdfs://localhost:9870"
      }
    }

mappings {
    my_mapping = {
      schema_file = "/opt/divolte/divolte-collector/conf/DivolteRecord.avsc"
      mapping_script_file = "/opt/divolte/divolte-collector/conf/mapping.groovy"
      sources = [browser]
      sinks = [divolte_kafka_sink, divolte_hdfs_sink]
    }
  }

  sinks {
    divolte_hdfs_sink = {
      type = hdfs
      file_strategy {
        sync_file_after_records = 1000
        sync_file_after_records = ${?DIVOLTE_HDFS_SINK_SYNC_NR_OF_RECORDS}
        sync_file_after_duration = 30 minutes
        sync_file_after_duration = ${?DIVOLTE_HDFS_SINK_SYNC_DURATION}
        working_dir = /tmp/working
        working_dir = ${?DIVOLTE_HDFS_SINK_WORKING_DIR}
        publish_dir = /tmp/processed
        publish_dir = ${?DIVOLTE_HDFS_SINK_PUBLISH_DIR}
      }
    }

For fs.DEFAULT_FS, Ive tried hdfs://localhost:9870 and hdfs://namenode:9870 (namenode is the name of the hdfs namenode container running in the same docker network)

friso commented 5 years ago

Can you be a bit more specific about not working? Do you see any errors?

tatianafrank commented 5 years ago

Here is the error:

[main] WARN [NativeCodeLoader]: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable divolte | Exception in thread "main" 2019-07-29 15:20:24.908Z [main] ERROR [HdfsFileManager]: Could not initialize HDFS filesystem or failed to check for existence of publish and / or working directories.. divolte | org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "hdfs" divolte | at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3332)

tatianafrank commented 5 years ago

Then I added fs.file.impl = "org.apache.hadoop.fs.LocalFileSystem" and fs.hdfs.impl = "org.apache.hadoop.hdfs.DistributedFileSystem" to my hdfs configuration in divolte and now im getting a different error:

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(Ljava/lang/String;)Ljava/net/InetSocketAddress; divolte | at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:99)

tatianafrank commented 5 years ago

according to this thread (https://stackoverflow.com/questions/45460909/accessing-hdfs-in-java-throws-error) there is an issue with dependency versions in divolte but im not sure who to change that in divolte...

JulienSerouart commented 5 years ago

This pull request may help : https://github.com/divolte/divolte-collector/pull/244