Closed nav568 closed 6 years ago
Thanks for reporting this. Can I first get your testing code to further investigate the issue?
Hey @naveenatceg, this post on our discussion group might be insightful: https://www.facebook.com/groups/rocksdb.dev/permalink/604723469626171/. Since deletes are not actually deletes, but rather tombstones (which get cleaned up after a compaction), your iterator might have to skip a bunch of deletes the get to the first key in the database.
Can you profile your code and see where it's spending most of the time?
Hey guys, Thanks for getting back. @igorcanadi It looks very similar to the issue we are observing. Since we run within the JVM it's difficult to get a good profiling report. I will write a test program in C and get back to you guys.
@naveenatceg any updates? :)
Hey Igor, Sorry for dropping ball on this, been busy with other stuff :(. Here's the code I used to test:
package com.linkedin.samza.test.perf;
import org.apache.samza.config.Config;
import org.apache.samza.metrics.Counter;
import org.apache.samza.metrics.Gauge;
import org.apache.samza.storage.kv.Entry;
import org.apache.samza.storage.kv.KeyValueIterator;
import org.apache.samza.storage.kv.KeyValueStore;
import org.apache.samza.system.IncomingMessageEnvelope;
import org.apache.samza.task.InitableTask;
import org.apache.samza.task.MessageCollector;
import org.apache.samza.task.StreamTask;
import org.apache.samza.task.TaskContext;
import org.apache.samza.task.TaskCoordinator;
import org.apache.samza.task.WindowableTask;
public class TestStoreAllPerformance implements StreamTask, InitableTask, WindowableTask {
private static final String METRICS_GROUP = TestStoreAllPerformance.class.getCanonicalName();
private KeyValueStore<Integer, Object> store = null;
private int messageCount = 0;
private int deletesPerWindow = 0;
private int messagesToPutInStore = 0;
private Gauge<Integer> itemsInStore = null;
private Counter timeSpentInAll = null;
private KeyValueIterator<Integer, Object> iterator = null;
@SuppressWarnings("unchecked")
@Override
public void init(Config config, TaskContext context) throws Exception {
deletesPerWindow = config.getInt("test.deletes.per.window", 8);
messagesToPutInStore = config.getInt("test.messages.to.put.in.store", Integer.MAX_VALUE);
System.err.println("The maximum messages allowed is:" + messagesToPutInStore);
store = (KeyValueStore<Integer, Object>) context.getStore("test-store-all-calls");
itemsInStore = context.getMetricsRegistry().newGauge(METRICS_GROUP, "items-in-store", 0);
timeSpentInAll = context.getMetricsRegistry().newCounter(METRICS_GROUP, "time-spent-in-all");
}
@Override
public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) throws Exception {
if (messageCount < messagesToPutInStore) {
store.put(messageCount, envelope.getMessage());
++messageCount;
itemsInStore.set(itemsInStore.getValue() + 1);
}
}
@Override
public void window(MessageCollector collector, TaskCoordinator coordinator) throws Exception {
long startTime = System.currentTimeMillis();
iterator = store.all();
timeSpentInAll.inc(System.currentTimeMillis() - startTime);
for (int i = 0; i < deletesPerWindow && iterator.hasNext(); ++i) {
Entry<Integer, Object> row = iterator.next();
store.delete(row.getKey());
itemsInStore.set(itemsInStore.getValue() - 1);
}
iterator.close();
}
}
The all() call does something very simple to open the iterator.
def all(): KeyValueIterator[Array[Byte], Array[Byte]] = {
metrics.alls.inc
val iter = db.newIterator()
iter.seekToFirst()
new RocksDbIterator(iter)
}
Yup, this is what's probably happening. You need to iterate over bunch of deletions on SeekToFirst(). One way to confirm this is by utilizing our perf_context functionality: https://github.com/facebook/rocksdb/blob/master/include/rocksdb/perf_context.h
From C++:
rocksdb::SetPerfLevel(rocksdb::kEnableCount);
rocksdb::perf_context.Reset();
// do your thing -- SeekToFirst()
cout << rocksdb::perf_context.ToString();
You should notice that internal_delete_skipped_count
is very high.
If your updates are always increasing, you can fix this by caching smallest live key and then instead of doing SeekToFirst(), just do Seek(smallest_live_key).
I did something similar in mongo-rocks: https://github.com/mongodb-partners/mongo-rocks/blob/master/src/rocks_record_store.cpp#L418 (check out _oplogNextToDelete)
Thank you for reporting this issue and appreciate your patience. We've notified the core team for an update on this issue. We're looking for a response within the next 30 days or the issue may be closed.
I recently found another issue in the test. One of our application stuck with long latency in this SeekToFirst() operation. I took a test program try to open the db and list all the keys and found that the latency listing n keys vs the latency listing n+1 keys sometimes differs by orders of magnitude: i.e. in my test, listing 141 keys from an open iterator takes 64ms and listing 142 keys from an open iterator takes 4. second. It seems likes that the problem does not only exist in SeekToFirst(). Iterator.next() may hit the same problem, if there are a large number of deleted records in-between two keys. The only solution to this seems to be the compaction. However, we noticed that there is no log indicating that compaction thread is active for about 6 days. Is it a known issue? Under which conditions the compaction thread will not actually compact the SST files? What would be the suggested solution?
Thanks a lot!
If there are no writes, the compaction will not happen. You can manually issue compaction by calling CompactRange() method.
@igorcanadi thanks for the quick response. I have the following additional questions: 1) How do we tune up the compaction threshold/trigger? I.e. is there a way to force a compaction if the size ratio condition is met but there is only reads no writes? The reason for that is: if the compact does not happen in this condition, the read performance will continue to be bad. 2) If we have to manually trigger compaction, is CompactRange() method a blocking / sync method or non-blocking/async?
Lastly, a newbie question, for the following snapshot of a RocksDB instance, how do I detect sorted runs here? Are the smaller number .sst files older runs or newer runs? {code} [yipan@ela4-app0956 Partition 23]$ ls -lrt total 3388844 -rw-r--r-- 1 app app 0 Nov 5 23:53 LOCK -rw-r--r-- 1 app app 37 Nov 5 23:53 IDENTITY -rw-r--r-- 1 app app 0 Nov 5 23:53 000003.log -rw-r--r-- 1 app app 16 Nov 5 23:53 CURRENT -rw-r--r-- 1 app app 3440456743 Nov 6 11:53 001591.sst -rw-r--r-- 1 app app 10478037 Nov 6 13:03 001592.sst -rw-r--r-- 1 app app 10672441 Nov 6 14:14 001593.sst -rw-r--r-- 1 app app 247978 Nov 6 14:14 MANIFEST-000005 -rw-r--r-- 1 app app 8290332 Nov 6 14:14 LOG {code}
Thanks a lot in advance!
Closing this via automation due to lack of activity. If discussion is still needed here, please re-open or create a new/updated issue.
Hey all, I am using RocksDB iterator to seek to beginning and then perform some deletes (around 32 deletes/second). The behavior we are noticing is that the seek call takes about a second to return and takes up almost all of the CPU.
We observed the iostats and we do not see any heavy IO operation being performed. We keep doing this operation repeatedly (for testing purposes), initially it doesn't take that long (takes about 100ms), but slowly creeps up and stays at 1s for a long time. When it hits this 1s threshold, RocksDB doesn't print any logs after this (nothing for 8 hours sometimes).
The first graph is the number of key-value pairs and the second graph is the time it takes to seek to the beginning.
Can someone please explain this behavior ? Why would it take the iterator that long to seek to beginning ? Why do the logs start disappearing ?
Configuration:
The last few minutes before RocksDB stops printing log: