confluentinc / ksql

The database purpose-built for stream processing applications.
https://ksqldb.io
Other
102 stars 1.04k forks source link

Error deleting obsolete state directroy #3327

Open afilimonov opened 5 years ago

afilimonov commented 5 years ago

We see periodic errors deleting obsolete state directory:

[2019-09-11 18:21:35,328] INFO stream-thread [_confluent-ksql-devops_ksql_prod_query_CTAS_DEVOPS_CLICKSTREAM_SESSIONS_3-ed9abb78-08b9-4563-8bf6-93d0b56efac1-CleanupThread] Deleting obsolete state directory 0_0 for task 0_0 as 600328ms has elapsed (cleanup delay is 600000ms). (org.apache.kafka.streams.processor.internals.StateDirectory:287)
[2019-09-11 18:21:35,342] ERROR stream-thread [_confluent-ksql-devops_ksql_prod_query_CTAS_DEVOPS_CLICKSTREAM_SESSIONS_3-ed9abb78-08b9-4563-8bf6-93d0b56efac1-CleanupThread] Failed to delete the state directory. (org.apache.kafka.streams.processor.internals.StateDirectory:311)
java.nio.file.DirectoryNotEmptyException: /var/lib/ksql/_confluent-ksql-devops_ksql_prod_query_CTAS_DEVOPS_CLICKSTREAM_SESSIONS_3/0_0
    at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:242)
    at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
    at java.nio.file.Files.delete(Files.java:1126)
    at org.apache.kafka.common.utils.Utils$2.postVisitDirectory(Utils.java:769)
    at org.apache.kafka.common.utils.Utils$2.postVisitDirectory(Utils.java:752)
    at java.nio.file.Files.walkFileTree(Files.java:2688)
    at java.nio.file.Files.walkFileTree(Files.java:2742)
    at org.apache.kafka.common.utils.Utils.delete(Utils.java:752)
    at org.apache.kafka.streams.processor.internals.StateDirectory.cleanRemovedTasks(StateDirectory.java:301)
    at org.apache.kafka.streams.processor.internals.StateDirectory.cleanRemovedTasks(StateDirectory.java:264)
    at org.apache.kafka.streams.KafkaStreams.lambda$start$1(KafkaStreams.java:802)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Similar issue when trying to terminate running query fails wutg java.nio.file.DirectoryNotEmptyException

KSQL version: 5.3.0 Environment: Kubernetes/Docker using Confluent provided image

big-andy-coates commented 4 years ago

Thanks for reporting. We're looking into it

afilimonov commented 4 years ago

Update. The issue seems to be related to dangling file locks in NAS based PVC. When switching to iSCSCI based storage the issue is going away.

apurvam commented 4 years ago

Thanks for the update @afilimonov . What's the NAS protocol?

afilimonov commented 4 years ago

NFS. We use Trident ONTAP-NAS k8s driver.

echang0929 commented 4 years ago

I have the same issue, and my state dir storage is based on IBM COS.

[2020-07-27 02:47:28,557] INFO stream-thread [_confluent-ksql-dswes-ccpqaquery_CTAS_TRACKING_BATCH_PIPELINE_COLL_0-91f314ff-d130-4edf-86cd-5e96c29a988a-CleanupThread] Deleting obsolete state directory 0_9 for task 0_9 as 2389556ms has elapsed (cleanup delay is 600000ms). (org.apache.kafka.streams.processor.internals.StateDirectory:287)
[2020-07-27 02:47:28,698] ERROR stream-thread [_confluent-ksql-dswes-ccpqaquery_CTAS_TRACKING_BATCH_PIPELINE_COLL_0-91f314ff-d130-4edf-86cd-5e96c29a988a-CleanupThread] Failed to delete the state directory. (org.apache.kafka.streams.processor.internals.StateDirectory:311)
java.nio.file.DirectoryNotEmptyException: /tmp/kafka-streams/_confluent-ksql-dswes-ccpqaquery_CTAS_TRACKING_BATCH_PIPELINE_COLL_0/0_9
    at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:242)
    at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
    at java.nio.file.Files.delete(Files.java:1126)
    at org.apache.kafka.common.utils.Utils$2.postVisitDirectory(Utils.java:769)
    at org.apache.kafka.common.utils.Utils$2.postVisitDirectory(Utils.java:752)
    at java.nio.file.Files.walkFileTree(Files.java:2688)
    at java.nio.file.Files.walkFileTree(Files.java:2742)
    at org.apache.kafka.common.utils.Utils.delete(Utils.java:752)
    at org.apache.kafka.streams.processor.internals.StateDirectory.cleanRemovedTasks(StateDirectory.java:301)
    at org.apache.kafka.streams.processor.internals.StateDirectory.cleanRemovedTasks(StateDirectory.java:264)
    at org.apache.kafka.streams.KafkaStreams.lambda$start$1(KafkaStreams.java:802)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)