GoogleCloudPlatform / pubsub

This repository contains open-source projects managed by the owners of Google Cloud Pub/Sub.
Apache License 2.0
245 stars 146 forks source link

java.lang.OutOfMemoryError: unable to create new native thread #256

Closed mehdihasan closed 3 years ago

mehdihasan commented 3 years ago

Hi,

We have a Kafka installation which sources and sinks data from and to GCP pub-sub. We were using an older version of the CPS connector. Recently we have decided to use the latest one.

So we have downloaded your last alpha release version of the CPS connector, i.e. v0.5-alpha.

We are facing an issue of increased thread consumption by the connector. Within 12 hours, it consumes 1K Threads. Whereas, the old version never consumes more than 30-60 threads over it's lifetime.

Finally, after finishing consuming all the available threads, it stops functioning along with other Kafka Connect components. We start getting the following error after it consume all the available threads:

2021-01-19 09:19:48,669 ERROR WorkerSourceTask{id=CPSSourceConnectorV0-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask) [task-thread-CPSSourceConnectorV0-0]
java.lang.OutOfMemoryError: unable to create new native thread
    at java.lang.Thread.start0(Native Method)
    at java.lang.Thread.start(Thread.java:717)
    at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
    at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1603)
    at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:334)
    at java.util.concurrent.ScheduledThreadPoolExecutor.scheduleAtFixedRate(ScheduledThreadPoolExecutor.java:573)
    at com.google.api.gax.rpc.Watchdog.start(Watchdog.java:93)
    at com.google.api.gax.rpc.Watchdog.create(Watchdog.java:81)
    at com.google.api.gax.rpc.InstantiatingWatchdogProvider.getWatchdog(InstantiatingWatchdogProvider.java:113)
    at com.google.api.gax.rpc.ClientContext.create(ClientContext.java:188)
    at com.google.cloud.pubsub.v1.stub.GrpcSubscriberStub.create(GrpcSubscriberStub.java:272)
    at com.google.pubsub.kafka.source.CloudPubSubGRPCSubscriber.makeSubscriber(CloudPubSubGRPCSubscriber.java:78)
    at com.google.pubsub.kafka.source.CloudPubSubGRPCSubscriber.pull(CloudPubSubGRPCSubscriber.java:54)
    at com.google.pubsub.kafka.source.CloudPubSubRoundRobinSubscriber.pull(CloudPubSubRoundRobinSubscriber.java:46)
    at com.google.pubsub.kafka.source.CloudPubSubSourceTask.poll(CloudPubSubSourceTask.java:155)
    at org.apache.kafka.connect.runtime.WorkerSourceTask.poll(WorkerSourceTask.java:265)
    at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:232)
    at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177)
    at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2021-01-19 09:19:48,669 ERROR WorkerSourceTask{id=CPSSourceConnectorV0-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask) [task-thread-CPSSourceConnectorV0-0]
eazhilan-nagarajan commented 3 years ago

Hello team,

Can someone help with an update on this issue? Did anyone else faced similar issues running the connector in Kubernetes?

kamalaboulhosn commented 3 years ago

@mehdihasan Do you know which version you were using before?

mehdihasan commented 3 years ago

@kamalaboulhosn thanks for your reply. Previously we have cloned the code from the master branch somewhere in early 2020.

mehdihasan commented 3 years ago

Hello @kamalaboulhosn / Guys,

  1. Did you guys able to regenerate the issue?
  2. Any plan for fix?
mehdihasan commented 3 years ago

Hi Again,

Which version of the connector you would suggest to use in production environment?

jinoobaek-qz commented 3 years ago

We are running into a similar issue where we see failures (due to large payloads, which isn't this connector's fault) and so we have automated restart cron jobs for the connector tasks. But then, eventually, it runs out of memory. Might not be the exact same issue as this, but, we see messages such as

Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d [209800.712s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d [2021-03-22 04:08:14,525] INFO [Producer clientId=connector-producer-pubsubSource-db-quizletWeb-qTermSave-6] Closing the Kafka producer with timeoutMillis = 0 ms. (org.apache.kafka.clients.producer.KafkaProducer:1182)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d [2021-03-22 04:08:14,525] ERROR Failed to start task pubsubSource-<redacted>-6 (org.apache.kafka.connect.runtime.Worker:472)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d org.apache.kafka.common.KafkaException: Failed to construct kafka producer
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d    at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:434)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d    at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:270)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d    at org.apache.kafka.connect.runtime.Worker.buildWorkerTask(Worker.java:523)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d    at org.apache.kafka.connect.runtime.Worker.startTask(Worker.java:467)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d    at org.apache.kafka.connect.runtime.distributed.DistributedHerder.startTask(DistributedHerder.java:1186)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d    at org.apache.kafka.connect.runtime.distributed.DistributedHerder.access$1600(DistributedHerder.java:127)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d    at org.apache.kafka.connect.runtime.distributed.DistributedHerder$11.call(DistributedHerder.java:950)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d    at org.apache.kafka.connect.runtime.distributed.DistributedHerder$11.call(DistributedHerder.java:931)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d    at org.apache.kafka.connect.runtime.distributed.DistributedHerder.tick(DistributedHerder.java:353)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d    at org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:293)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d    at java.base/java.lang.Thread.run(Thread.java:834)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d Caused by: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d    at java.base/java.lang.Thread.start0(Native Method)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d    at java.base/java.lang.Thread.start(Thread.java:803)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d    at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:426)
Mar 22 04:08:14 92deaaaa-6188-45c9-b4cc-5dc55b8e707d    ... 14 more

We are using v0.5-alpha connector cluster 5.0.0.

mehdihasan commented 3 years ago

Thanks @jinoobaek-qz. We were getting kind of the same and had captured that into prometheus grafana dashboard.

live_thread

You can see that the live thread count go beyond 4000. After it crossed that point we were start getting "Caused by: java.lang.OutOfMemoryError: unable to create native thread". In normal scenario (in fact for the previous version of the connector), the live thread count never goes beyond 70-80.

mehdihasan commented 3 years ago

I am not facing this issue with version v0.8 Alpha.