divolte / divolte-collector

Divolte Collector
https://divolte.io/
Apache License 2.0
282 stars 77 forks source link

No documentation on S3 sink setup #272

Open tatianafrank opened 5 years ago

tatianafrank commented 5 years ago

The documentation says you can use S3 as a file sink but gives zero details on how to do so. There is one line linking somewhere else but the link is broken. These are the docs: http://divolte-releases.s3-website-eu-west-1.amazonaws.com/divolte-collector/0.9.0/userdoc/html/configuration.html and this is the broken link: https://wiki.apache.org/hadoop/AmazonS3

friso commented 5 years ago

Divolte doesn't treat S3 any differently than HDFS. This means you can use the built in support of the HDFS client to access S3 buckets of a particular layout.

Divolte currently ships with hadoop 3.2.0, so the relevant updated link on AWS integration (including using S3 filesystems) is here: https://hadoop.apache.org/docs/r2.9.2/hadoop-aws/tools/hadoop-aws/index.html

Note that there are now three different S3 client implementations in Hadoop, which all use different layouts on S3. If your aim is to use Divolte just for collection and subsequently use the Avro files on S3 using tools other than Hadoop, the s3n or s3a is probably what you want. s3n has been available for a while, whereas s3a is still under development, but is aimed to be a drop in replacement for s3n down the line. s3a is mostly aimed at use cases at substantial scale, involving large files that can become a performance issue for s3n.

tatianafrank commented 5 years ago

ok im using s3a with the following config: client { fs.DEFAULT_FS = "https://s3.us.cloud-object-storage.appdomain.cloud" fs.defaultFS = "https://s3.us.cloud-object-storage.appdomain.cloud" fs.s3a.bucket.BUCKET_NAME.access.key = "" fs.s3a.bucket.BUCKET_NAME.secret.key = "" fs.s3a.bucket.BUCKET_NAME.endpoint = "https://s3.us.cloud-object-storage.appdomain.cloud" }

But im getting the following error even though I do have a tmp/working directory: Path for in-flight AVRO records is not a directory: /tmp/working So Im guessing its not properly connecting to s3 since the directory DOES exist. Is something wrong about my config? My s3 provider is not AWS but another cloud provider so I used the URL structure is a little different. Am I supposed to set the fs.defaultFS to the s3 URL? Where do I set the bucket?

tatianafrank commented 5 years ago

I changed my settings to the below and tried both s3a, s3n, and s3 and got the same error: "org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3a" (or "s3n" or "s3")

client { fs.DEFAULT_FS = "s3a://BUCKET-NAME" fs.defaultFS = "s3a://BUCKET-NAME" fs.s3a.access.key = "" fs.s3a.secret.key = "" fs.s3a.endpoint = "https://s3.us.cloud-object-storage.appdomain.cloud" }

krisgeus commented 5 years ago

The libraries might not be shipped with divolte and you need some additional settings

fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem

depending on your version of hadoop: http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar

krisgeus commented 5 years ago

I did a quick check with the docker divolte image. This is what was needed:

libraries downloaded and put into /opt/divolte/divolte-collector/lib http://central.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.271/aws-java-sdk-bundle-1.11.271.jar http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.1.1/hadoop-aws-3.1.1.jar

config: client { fs.DEFAULT_FS = "s3a://avro-bucket" fs.defaultFS = "s3a://avro-bucket" fs.s3a.access.key = foo fs.s3a.secret.key = bar fs.s3a.endpoint = "s3-server:4563" fs.s3a.path.style.access = true fs.s3a.connection.ssl.enabled = false fs.s3a.impl = org.apache.hadoop.fs.s3a.S3AFileSystem fs.s3.impl = org.apache.hadoop.fs.s3a.S3AFileSystem }

enable hdfs through env vars in docker-compose: DIVOLTE_HDFS_ENABLED: "true" DIVOLTE_HDFS_SINK_WORKING_DIR: "s3a://avro-bucket/tmp/s3working" DIVOLTE_HDFS_SINK_PUBLISH_DIR: "s3a://avro-bucket/tmp/s3publish"

s3-server is a localstack docker container which mimics s3.

krisgeus commented 5 years ago

Oh and make sure the bucket is availabe and the tmp/s3working and tmp/s3publish keys are present. (A directory exists check is done so adding a file to the bucket with a correct key prefix fools the hdfs client)

tatianafrank commented 5 years ago

thanks for looking into this @krisgeus im just a little confused about something. Im trying to use s3 instead of hdfs so why do I need hdfs to be running for this to work?

tatianafrank commented 5 years ago

I did everything you listed above and its not working. I got an error about a missing hadoop.tmp.dir var so I added that and now theres no error but there's no files being added to s3 either. Since theres no error im not sure what the issue is.

krisgeus commented 5 years ago

Sorry for the late response (Holiday season). Without an error I cannot help you out either. With the steps provided above I managed to create a working example based on the divolte docker image.

rakzcs commented 2 years ago

I did a quick check with the docker divolte image. This is what was needed:

libraries downloaded and put into /opt/divolte/divolte-collector/lib http://central.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.271/aws-java-sdk-bundle-1.11.271.jar http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.1.1/hadoop-aws-3.1.1.jar

config: client { fs.DEFAULT_FS = "s3a://avro-bucket" fs.defaultFS = "s3a://avro-bucket" fs.s3a.access.key = foo fs.s3a.secret.key = bar fs.s3a.endpoint = "s3-server:4563" fs.s3a.path.style.access = true fs.s3a.connection.ssl.enabled = false fs.s3a.impl = org.apache.hadoop.fs.s3a.S3AFileSystem fs.s3.impl = org.apache.hadoop.fs.s3a.S3AFileSystem }

enable hdfs through env vars in docker-compose: DIVOLTE_HDFS_ENABLED: "true" DIVOLTE_HDFS_SINK_WORKING_DIR: "s3a://avro-bucket/tmp/s3working" DIVOLTE_HDFS_SINK_PUBLISH_DIR: "s3a://avro-bucket/tmp/s3publish"

s3-server is a localstack docker container which mimics s3.

Been trying this but keep getting Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found error. Do i need to install the complete hadoop application as well or am i missing something else?

edit: it seems the libraries are very particular on the version you use solution: https://hadoop.apache.org/docs/r3.3.1/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html