apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.53k stars 3.71k forks source link

Realtime cleanup may not work if there are exceptions during cleanup #1082

Closed fjy closed 5 years ago

fjy commented 9 years ago

T0aD in IRC has reported a strange issue with segments not being handed off. I believe the following may have occurred:

1) A segment was successfully handed off but something went wrong during the cleanup process (we don't have the logs, but it appears the directory was not cleaned up) 2) The realtime node was restarted. The node found the directory of the segment exists, and announced it was serving that segment. Subsequently, the segment was not removed.

T0aD commented 9 years ago

Here are the logs of one of the segment that was handed off but not deleting on disk:

2015-01-29 14:41:40,334 [myid:] - INFO [supercell-2015-01-29T13:00:00.000Z-persist-n-merge:Logger@70] - Uploading [/var/druid-data/supercell/2015-01- 29T13:00:00.000Z_2015-01-29T14:00:00.000Z/merged] to S3 2015-01-29 14:41:40,334 [myid:] - INFO [supercell-2015-01-29T13:00:00.000Z-persist-n-merge:Logger@70] - Adding file[/var/druid-data/supercell/2015-01 -29T13:00:00.000Z_2015-01-29T14:00:00.000Z/merged/version.bin] with size[4]. Total size so far[0] 2015-01-29 14:41:40,335 [myid:] - INFO [supercell-2015-01-29T13:00:00.000Z-persist-n-merge:Logger@70] - Adding file[/var/druid-data/supercell/2015-01 -29T13:00:00.000Z_2015-01-29T14:00:00.000Z/merged/00000.smoosh] with size[242,823,209]. Total size so far[4] 2015-01-29 14:41:42,950 [myid:] - INFO [HttpPostEmitter-1-0:ChannelResourceFactory@60] - Generating: http://logstash.vigiglo.be:8093 2015-01-29 14:41:47,963 [myid:] - INFO [HttpPostEmitter-1-0:ChannelResourceFactory@60] - Generating: http://logstash.vigiglo.be:8093 2015-01-29 14:41:52,976 [myid:] - INFO [HttpPostEmitter-1-0:ChannelResourceFactory@60] - Generating: http://logstash.vigiglo.be:8093 2015-01-29 14:41:54,860 [myid:] - INFO [supercell-2015-01-29T13:00:00.000Z-persist-n-merge:Logger@70] - Adding file[/var/druid-data/supercell/2015-01 -29T13:00:00.000Z_2015-01-29T14:00:00.000Z/merged/meta.smoosh] with size[13,362]. Total size so far[242,823,213] 2015-01-29 14:41:55,140 [myid:] - INFO [supercell-2015-01-29T13:00:00.000Z-persist-n-merge:Logger@70] - Pushing S3Object [key=supercell/2015-01-29T13 :00:00.000Z_2015-01-29T14:00:00.000Z/2015-01-29T13:00:00.000Z/4/index.zip, bucket=XXXXX, lastModified=null, dataInputStream=null, Metadata={x-amz -acl=bucket-owner-full-control, Content-Length=89411661, Content-MD5=Scs48IZ5ZkhE2khVFaBLFA==, md5-hash=49cb38f08679664844da485515a04b14, Content-Type =application/zip}]. 2015-01-29 14:41:57,989 [myid:] - INFO [HttpPostEmitter-1-0:ChannelResourceFactory@60] - Generating: http://logstash.vigiglo.be:8093 2015-01-29 14:41:58,605 [myid:] - INFO [supercell-2015-01-29T13:00:00.000Z-persist-n-merge:Logger@70] - Pushing S3Object [key=supercell/2015-01-29T13 :00:00.000Z_2015-01-29T14:00:00.000Z/2015-01-29T13:00:00.000Z/4/descriptor.json, bucket=prod.druid, lastModified=null, dataInputStream=null, Metadata= {x-amz-acl=bucket-owner-full-control, Content-Length=7472, Content-MD5=XXXX, md5-hash=XXXX, Content-Ty pe=application/octet-stream}] 2015-01-29 14:41:58,665 [myid:] - INFO [supercell-2015-01-29T13:00:00.000Z-persist-n-merge:Logger@70] - Deleting zipped index File[/tmp/druid62342780 90180139006index.zip] 2015-01-29 14:41:58,684 [myid:] - INFO [supercell-2015-01-29T13:00:00.000Z-persist-n-merge:Logger@70] - Deleting descriptor file[/tmp/druid1834983339 966338370descriptor.json]

du -sh /var/druid-data/supercell/*

0 /var/druid-data/supercell/2015-01-28T19:00:00.000Z_2015-01-28T20:00:00.000Z 0 /var/druid-data/supercell/2015-01-29T08:00:00.000Z_2015-01-29T09:00:00.000Z 483M /var/druid-data/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z 497M /var/druid-data/supercell/2015-01-29T15:00:00.000Z_2015-01-29T16:00:00.000Z 2,3G /var/druid-data/supercell/2015-01-29T17:00:00.000Z_2015-01-29T18:00:00.000Z 876M /var/druid-data/supercell/2015-01-29T18:00:00.000Z_2015-01-29T19:00:00.000Z [...]

T0aD commented 9 years ago

I removed the empty / corrupted segment directories and relaunched the realtime node:

2015-02-05 12:22:53,056 [myid:] - INFO [chief-supercell:Logger@70] - Loading previously persisted segment at [/var/druid-data/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/0] 2015-02-05 12:22:53,285 [myid:] - INFO [chief-supercell:Logger@70] - Loading previously persisted segment at [/var/druid-data/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/1] 2015-02-05 12:22:53,507 [myid:] - INFO [chief-supercell:Logger@70] - Loading previously persisted segment at [/var/druid-data/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/2] 2015-02-05 12:22:53,746 [myid:] - INFO [chief-supercell:Logger@70] - Loading previously persisted segment at [/var/druid-data/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/3] 2015-02-05 12:22:53,786 [myid:] - INFO [ServerInventoryView-0:Logger@70] - CHILD_ADDED[supercell_2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z_2015-01-29T13:00:00.000Z_4] with version[0] 2015-02-05 12:22:54,006 [myid:] - INFO [chief-supercell:Logger@70] - Loading previously persisted segment at [/var/druid-data/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/4] 2015-02-05 12:22:54,059 [myid:] - INFO [ServerInventoryView-0:Logger@70] - CHILD_ADDED[supercell_2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z_2015-01-29T13:00:00.000Z_3] with version[0] 2015-02-05 12:22:54,136 [myid:] - INFO [ServerInventoryView-0:Logger@70] - CHILD_ADDED[supercell_2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z_2015-01-29T13:00:00.000Z_3] with version[0] 2015-02-05 12:22:54,137 [myid:] - INFO [ServerInventoryView-0:Logger@70] - CHILD_ADDED[supercell_2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z_2015-01-29T13:00:00.000Z_4] with version[0] 2015-02-05 12:22:54,190 [myid:] - INFO [chief-supercell:Logger@70] - Loading previously persisted segment at [/var/druid-data/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/5] 2015-02-05 12:22:54,366 [myid:] - INFO [chief-supercell:Logger@70] - Loading previously persisted segment at [/var/druid-data/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/6] 2015-02-05 12:22:54,483 [myid:] - INFO [chief-supercell:Logger@70] - Loading previously persisted segment at [/var/druid-data/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/7] 2015-02-05 12:22:54,643 [myid:] - INFO [chief-supercell:Logger@70] - Loading previously persisted segment at [/var/druid-data/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/8] 2015-02-05 12:22:54,828 [myid:] - INFO [chief-supercell:Logger@70] - Loading previously persisted segment at [/var/druid-data/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/9] 2015-02-05 12:22:55,009 [myid:] - INFO [chief-supercell:Logger@70] - Loading previously persisted segment at [/var/druid-data/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/10] 2015-02-05 12:22:55,193 [myid:] - INFO [chief-supercell:Logger@70] - Loading previously persisted segment at [/var/druid-data/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/11] 2015-02-05 12:22:55,402 [myid:] - INFO [chief-supercell:Logger@70] - Loading previously persisted segment at [/var/druid-data/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/12] 2015-02-05 12:22:55,660 [myid:] - INFO [chief-supercell:Logger@70] - Announcing segment[supercell_2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z_2015-01-29T13:00:00.000Z_4] to path[/druid/servedSegments/druid4.vigi:8083/supercell_2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z_2015-01-29T13:00:00.000Z_4] 2015-02-05 12:22:55,674 [myid:] - INFO [chief-supercell:Logger@70] - Announcing segment[supercell_2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z_2015-01-29T13:00:00.000Z_4] at path[/druid/segments/druid4.vigi:8083/2015-02-05T12:22:55.673Z0] 2015-02-05 12:22:55,678 [myid:] - INFO [ServerInventoryView-0:Logger@70] - CHILD_ADDED[supercell_2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z_2015-01-29T13:00:00.000Z_4] with version[0]

No cleanup of directory...

S3 state:

$ ./s3cmd ls s3://prod.druid/supercell/ | grep 2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z DIR s3://prod.druid/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/ $ ./s3cmd du s3://prod.druid/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/ 197519582 s3://prod.druid/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/

On the historicals:

ansible -i ./prod.hosts -m shell historical -a "find /var/druid-historical/ -name '2015-01-29T13*'" druid5 | success | rc=0 >> /var/druid-historical/indexCache/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z /var/druid-historical/indexCache/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/2015-01-29T13:00:00.000Z

druid6 | success | rc=0 >>

druid8 | success | rc=0 >> /var/druid-historical/indexCache/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z /var/druid-historical/indexCache/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/2015-01-29T13:00:00.000Z

druid7 | success | rc=0 >> /var/druid-historical/indexCache/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z /var/druid-historical/indexCache/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/2015-01-29T13:00:00.000Z

fjy commented 9 years ago

@T0aD

Can you look in the coordinator logs to see if a segment that was not removed was announced there?

I suspect it is given it shows up in our coordinator console, which leads me to believe there is something wrong with your realtime setup.

T0aD commented 9 years ago

On cortex1: (master coordinator)

root@cortex1:/proc/31578/cwd# grep 2015-01-29T13:00:00.000Z_2015-01-29T14 ./stdout | awk -F' - ' '{print $3}' | grep ^Server Server[druid8.vigi:8081] added segment[supercell_2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z_2015-01-29T13:00:00.000Z_3] Server[druid8.vigi:8081] added segment[supercell_2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z_2015-01-29T13:00:00.000Z_4] Server[druid4.vigi:8083] added segment[supercell_2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z_2015-01-29T13:00:00.000Z_4] Server[druid5.vigi:8081] added segment[supercell_2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z_2015-01-29T13:00:00.000Z_3] Server[druid5.vigi:8081] added segment[supercell_2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z_2015-01-29T13:00:00. 000Z_4]

On druid4 (realtime node):

/var/druid-data/supercell/2015-01-29T13:00:00.000Z_2015-01-29T14:00:00.000Z/isPushedMarker

So it seems it was pushed but not cleaned up, and already stored in 2 historical nodes, I probably should just remove those segments .. ?

T0aD commented 9 years ago

Since I don't have much time to investigate on the issue, I wrote a little helper script to move already pushed segments, please tell me if it sounds ok to you, thanks.

https://gist.github.com/T0aD/9bf39f521c361321cd39

vogievetsky commented 5 years ago

Closing as the realtime node was removed