VladRodionov / bigbase

BigBase - read optimized, fully HBase-compatible, NoSQL Data Store
GNU Affero General Public License v3.0
9 stars 1 forks source link

Block cache: L3 FATAL no space left on device #50

Closed VladRodionov closed 10 years ago

VladRodionov commented 10 years ago

We need special treatment for this error: DISK STORAGE =137686247146 ITEMS IN STORE=15840571 14/05/25 12:31:08 FATAL storage.FileExtStorage: java.io.IOException: No space left on device java.io.IOException: No space left on device at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:51) at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:205) at com.koda.integ.hbase.storage.FileExtStorage$FileFlusher.run(FileExtStorage.java:303) 14/05/25 12:31:08 FATAL storage.FileExtStorage: file-flusher thread died. 14/05/25 12:31:08 FATAL storage.FileExtStorage: java.io.IOException: No space left on device java.io.IOException: No space left on device at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:51) at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:205) at com.koda.integ.hbase.storage.FileExtStorage$FileFlusher.run(FileExtStorage.java:303) 14/05/25 12:31:08 FATAL storage.FileExtStorage: file-flusher thread died.

Couple notes:

  1. I have no idea why this happened. The device raw size was 148G, for cache I allocated 140G. df -h shows 100% utilization, direct total count - 138G, du -s - 129G. Linux?
VladRodionov commented 10 years ago

There is no way to mitigate this issue. We can not retry file write operation because we do not know how was the last (failed) write. The failed write might have succeeded partially, but data might be corrupted?

The recommendation: do not over-allocate cache partition.

VladRodionov commented 10 years ago

Added check on cache partition, if usable space becomes less than ( 1 - storageHighWatermark) * totalPartitionSpace then we purge the oldest file from cache.

Configuration parameters are: storage.recycler.high.watermark (default, 0.98) storage.recycler.low.watermark (defualt, 0.97)

To increase usable partition space one can use tune2fs command

VladRodionov commented 10 years ago

Found the explanation for the discrepancy between df -h /cache and du -s /cache

[ec2-user@ip-10-146-230-202 test]$ df -h /cache
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        148G  142G  3.5G  98% /cache
[ec2-user@ip-10-146-230-202 test]$ du -s -h /cache
du: cannot read directory ‘/cache/lost+found’: Permission denied
97G /cache

http://stackoverflow.com/questions/4424461/du-skh-in-returns-vastly-different-size-from-df-on-centos-5-5


The most common cause of this effect is open files that have been deleted.
The kernel will only free the disk blocks of a deleted file if it is not in use at the time of its deletion. Otherwise that is deferred until the file is closed, or the system is rebooted.

Therefore, before deleting old cache file we must guarantee that there are no open file descriptors hanging around, then close it, then delete it. To check if we have deleted files hanging around:

sudo lsof 2>/dev/null | grep deleted
VladRodionov commented 10 years ago

Finally, did it right. It took some time to figure out the reason.

VladRodionov commented 10 years ago

Final 24h drive test on AWS showed no partition space leakage. Fixed.