apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.44k stars 3.69k forks source link

Historicals have issues connecting to S3 #6500

Closed Stephan3555 closed 5 years ago

Stephan3555 commented 5 years ago

Hi,

since a few days our Historical Nodes on AWS have temporarily problems connecting to S3. The Data are still written in S3, but from time to time follwoing error occurs (I replaced sensitive Information with <>):

ERROR-Message:

Failed on try 1, retrying in 739ms. org.jets3t.service.ServiceException: Request Error: \<BUCKETNAME>.s3.eu-central-1.amazonaws.com: Name or service not known at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:625) ~[jets3t-0.9.4.jar:0.9.4] at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:279) ~[jets3t-0.9.4.jar:0.9.4] at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:1052) ~[jets3t-0.9.4.jar:0.9.4] at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2264) ~[jets3t-0.9.4.jar:0.9.4] at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2193) ~[jets3t-0.9.4.jar:0.9.4] at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1120) ~[jets3t-0.9.4.jar:0.9.4] at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:575) ~[jets3t-0.9.4.jar:0.9.4] at io.druid.storage.s3.S3Utils.isObjectInBucket(S3Utils.java:96) ~[?:?] at io.druid.storage.s3.S3DataSegmentPuller$4.call(S3DataSegmentPuller.java:318) ~[?:?] at io.druid.storage.s3.S3DataSegmentPuller$4.call(S3DataSegmentPuller.java:314) ~[?:?] at io.druid.java.util.common.RetryUtils.retry(RetryUtils.java:63) [java-util-0.12.3.jar:0.12.3] at io.druid.java.util.common.RetryUtils.retry(RetryUtils.java:81) [java-util-0.12.3.jar:0.12.3] at io.druid.storage.s3.S3Utils.retryS3Operation(S3Utils.java:89) [druid-s3-extensions-0.12.3.jar:0.12.3] at io.druid.storage.s3.S3DataSegmentPuller.isObjectInBucket(S3DataSegmentPuller.java:312) [druid-s3-extensions-0.12.3.jar:0.12.3] at io.druid.storage.s3.S3DataSegmentPuller.getSegmentFiles(S3DataSegmentPuller.java:176) [druid-s3-extensions-0.12.3.jar:0.12.3] at io.druid.storage.s3.S3LoadSpec.loadSegment(S3LoadSpec.java:60) [druid-s3-extensions-0.12.3.jar:0.12.3] at io.druid.segment.loading.SegmentLoaderLocalCacheManager.loadInLocation(SegmentLoaderLocalCacheManager.java:205) [druid-server-0.12.3.jar:0.12.3] at io.druid.segment.loading.SegmentLoaderLocalCacheManager.loadInLocationWithStartMarker(SegmentLoaderLocalCacheManager.java:193) [druid-server-0.12.3.jar:0.12.3] at io.druid.segment.loading.SegmentLoaderLocalCacheManager.loadSegmentWithRetry(SegmentLoaderLocalCacheManager.java:151) [druid-server-0.12.3.jar:0.12.3] at io.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegmentFiles(SegmentLoaderLocalCacheManager.java:133) [druid-server-0.12.3.jar:0.12.3] at io.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegment(SegmentLoaderLocalCacheManager.java:108) [druid-server-0.12.3.jar:0.12.3] at io.druid.server.SegmentManager.getAdapter(SegmentManager.java:196) [druid-server-0.12.3.jar:0.12.3] at io.druid.server.SegmentManager.loadSegment(SegmentManager.java:157) [druid-server-0.12.3.jar:0.12.3] at io.druid.server.coordination.SegmentLoadDropHandler.loadSegment(SegmentLoadDropHandler.java:261) [druid-server-0.12.3.jar:0.12.3] at io.druid.server.coordination.SegmentLoadDropHandler.addSegment(SegmentLoadDropHandler.java:307) [druid-server-0.12.3.jar:0.12.3] at io.druid.server.coordination.SegmentChangeRequestLoad.go(SegmentChangeRequestLoad.java:47) [druid-server-0.12.3.jar:0.12.3] at io.druid.server.coordination.ZkCoordinator$1.childEvent(ZkCoordinator.java:118) [druid-server-0.12.3.jar:0.12.3] at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:520) [curator-recipes-4.0.0.jar:4.0.0] at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:514) [curator-recipes-4.0.0.jar:4.0.0] at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) [curator-framework-4.0.0.jar:4.0.0] at org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:296) [curator-client-4.0.0.jar:?] at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) [curator-framework-4.0.0.jar:4.0.0] at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:512) [curator-recipes-4.0.0.jar:4.0.0] at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35) [curator-recipes-4.0.0.jar:4.0.0] at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:771) [curator-recipes-4.0.0.jar:4.0.0] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_181] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_181] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_181] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_181] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181] Caused by: java.net.UnknownHostException: \<BUCKETNAME>.s3.eu-central-1.amazonaws.com: Name or service not known at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) ~[?:1.8.0_181] at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) ~[?:1.8.0_181] at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) ~[?:1.8.0_181] at java.net.InetAddress.getAllByName0(InetAddress.java:1276) ~[?:1.8.0_181] at java.net.InetAddress.getAllByName(InetAddress.java:1192) ~[?:1.8.0_181] at java.net.InetAddress.getAllByName(InetAddress.java:1126) ~[?:1.8.0_181] at org.apache.http.impl.conn.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:45) ~[httpclient-4.5.1.jar:4.5.1] at org.apache.http.impl.conn.DefaultClientConnectionOperator.resolveHostname(DefaultClientConnectionOperator.java:259) ~[httpclient-4.5.1.jar:4.5.1] at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:159) ~[httpclient-4.5.1.jar:4.5.1] at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:144) ~[httpclient-4.5.1.jar:4.5.1] at org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:131) ~[httpclient-4.5.1.jar:4.5.1] at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) ~[httpclient-4.5.1.jar:4.5.1] at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) ~[httpclient-4.5.1.jar:4.5.1] at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) ~[httpclient-4.5.1.jar:4.5.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) ~[httpclient-4.5.1.jar:4.5.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) ~[httpclient-4.5.1.jar:4.5.1] at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:328) ~[jets3t-0.9.4.jar:0.9.4] ... 41 more

We are using currently Druid 0.12.3 with following configuration:

common.runtime.properties:

# Extensions druid.extensions.loadList=["druid-kafka-indexing-service", "druid-histogram", "druid-datasketches", \ "druid-lookups-cached-global", "postgresql-metadata-storage", "druid-s3-extensions", "druid-avro-extensions", \ "graphite-emitter"]

# Zookeeper druid.zk.service.host=\<ZK_HOST> druid.zk.paths.base=/druid

# Metadata storage druid.metadata.storage.type=postgresql druid.metadata.storage.connector.connectURI=jdbc:postgresql://\<POSTGRES_URL>:5432/druid druid.metadata.storage.connector.user=\<USER> druid.metadata.storage.connector.password=\<PASSWORD>

# Deep storage druid.storage.type=s3 druid.storage.bucket=\<BUCKET> druid.storage.baseKey=segments druid.s3.accessKey=\<ACCESS_KEY> druid.s3.secretKey=\<SECRET_KEY>

# Logging druid.startup.logging.logProperties=true druid.indexer.logs.type=noop

# Service discovery druid.selectors.indexing.serviceName=druid/overlord druid.selectors.coordinator.serviceName=druid/coordinator

# Monitoring druid.monitoring.monitors=["io.druid.java.util.metrics.JvmMonitor"] druid.emitter=graphite druid.emitter.logging.logLevel=info druid.emitter.graphite.hostname=\<GRAPHITE_HOST> druid.emitter.graphite.port=9109 druid.emitter.graphite.eventConverter={"type":"all", "namespacePrefix": "druid"} druid.emitter.graphite.protocol=plaintext

# Caching druid.cache.type=caffeine druid.cache.sizeInBytes=1073741824

# Storage type of double columns druid.indexing.doubleStorage=double

# Misc druid.javascript.enabled=true druid.sql.enable=true

# Maximum Amount of Heap space to use for the string dictionary during merging (broker, historical, middlemanager) druid.query.groupBy.maxMergingDictionarySize = 250000000 druid.query.groupBy.maxOnDiskStorage = 4294967296

jets3t.properties:

s3service.s3-endpoint=s3.eu-central-1.amazonaws.com storage-service.request-signature-version=AWS4-HMAC-SHA256

Does anybody else experience this behavior? I gladly provide more information/configuration to solve this issue.

Thanks, Stephan

patelh commented 5 years ago

Sounds like an AWS error, not really a druid one, no?

patelh commented 5 years ago

Does it retry to download the file?

Stephan3555 commented 5 years ago

Hi patelh, thanks for your quick response. Could be an AWS error, thats the reason why I posted here. So maybe others experience this as well. Regarding your second question, after a while the download/upload to s3 seems to work. So it seems to be only temporarely unavailable.

We are not sure what the reason behind this is. Could be a networkproblem with our Rancher installation as well. Or the jets3t Library is not fully compatible with AWS/S3 anymore (Jets3t-0.9.4 is from August 2015 and since then never updated)

But its strange that it started a few days ago till now. Overall the Druid-Cluster is working but the Historical logs got flooded with this error messages

Stephan3555 commented 5 years ago

We found the cause for our problem. For us the problem was the amount of segments we had in our cluster. We had around 21k segments distributed on three historicals. After some debugging we found out that the problem was the connection pool size:

image

We merged the incredible small segments to larger segments and now the error disappeared.