StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.9k stars 1.51k forks source link

HashScan page size ignored, returns all elements #729

Closed Salgat closed 6 years ago

Salgat commented 7 years ago

For the method

IEnumerable HashScan(RedisKey key, RedisValue pattern = default(RedisValue), int pageSize = 10, long cursor = 0, int pageOffset = 0, CommandFlags flags = CommandFlags.None);

The page size is ignored. In my tests, regardless of the size, it returns all elements for the given hash key (which defeats the point of scan for me, since I can just use HashGetAllAsync(key)). This prevents me from iterating through results in batches (since I have hashes spanning hundreds of thousands of entries that won't all fit in memory).

I ran the same command in redis-cli (HSCAN with COUNT 10) and for 5000 entries in the hash I get back ~20 entries per response. Same goes for the default call without count specified. It seems almost like StackExchange.Redis' HashScan internally iterates through the entire hash before returning a result.

huanbd commented 6 years ago

Hi, Is there someone has any solution? I have got same problem

Salgat commented 6 years ago

Hey @huanbd, at least for now I'm using HashKeysAsync(...) to get all keys for a hash then iterating through each key with HashGetAsync(...). It's not optimal but at least it works for the time being until the batch get is fixed. :(

mgravell commented 6 years ago

The page size impacts what happens behind the scenes - how much data is buffered per iteration. It is not the intent of the parameter to restrict the quantity of data - you might use Take() for that.

To see the effect of this parameter would require redis-cli monitor. It is mostly an advanced feature.

On 20 Oct 2017 7:47 p.m., "Austin Salgat" notifications@github.com wrote:

For the method

IEnumerable HashScan(RedisKey key, RedisValue pattern = default(RedisValue), int pageSize = 10, long cursor = 0, int pageOffset = 0, CommandFlags flags = CommandFlags.None);

The page size is ignored. In my tests, regardless of the size, it returns all elements for the given hash key (which defeats the point of scan for me, since I can just use HashGetAllAsync(key)). This prevents me from iterating through results in batches (since I have hashes spanning hundreds of thousands of entries that won't all fit in memory).

I ran the same command in redis-cli (HSCAN with COUNT 10) and for 5000 entries in the hash I get back ~20 entries per response. Same goes for the default call without count specified. It seems almost like StackExchange.Redis' HashScan internally iterates through the entire hash before returning a result.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/StackExchange/StackExchange.Redis/issues/729, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDsP1LZfG3KKVRhL4aPdrtD_-4eS9pks5suOrDgaJpZM4QBEyk .

Salgat commented 6 years ago

See the documentation: https://redis.io/commands/scan

The point of hashscan is to iterate the collection in manageable batches. Even if it does use batches behind the scenes, I can't fit millions of entries into my application's memory. Look at the documentation related to IScanningCursor returned by the HashScan operation in this library, which is to be used to continue the hash scan (just like in the documentation I linked). The problem is that the cursor always returns 0, defeating the entire point of using IScanningCursor.

mgravell commented 6 years ago

You should be able to query additional details from the cursor while iterating. This is primarily intend to resume long operations from a known-good position. Since IEnumerable-T is non-buffered it is not required that you fit everything in memory - you can just foreach over it without a backing list/array.

On 7 Nov 2017 7:08 p.m., "Austin Salgat" notifications@github.com wrote:

See the documentation: https://redis.io/commands/scan

The point of hashscan is to iterate the collection in manageable batches. Even if it does use batches behind the scenes, I can't fit millions of entries into my application's memory. Look at the documentation related to IScanningCursor returned by the HashScan operation in this library, which is to be used to continue the hash scan (just like in the documentation I linked). The problem is that the cursor always returns 0, defeating the entire point of using IScanningCursor.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/StackExchange/StackExchange.Redis/issues/729#issuecomment-342588625, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDsM5XnlZyxZLTl4iJlHWrAAjOvhTUks5s0KqlgaJpZM4QBEyk .

mgravell commented 6 years ago

note that you will need to consume enough data to cause it to load another page before the cursor updates

On 7 Nov 2017 7:12 p.m., "Marc Gravell" marc.gravell@gmail.com wrote:

You should be able to query additional details from the cursor while iterating. This is primarily intend to resume long operations from a known-good position. Since IEnumerable-T is non-buffered it is not required that you fit everything in memory - you can just foreach over it without a backing list/array.

On 7 Nov 2017 7:08 p.m., "Austin Salgat" notifications@github.com wrote:

See the documentation: https://redis.io/commands/scan

The point of hashscan is to iterate the collection in manageable batches. Even if it does use batches behind the scenes, I can't fit millions of entries into my application's memory. Look at the documentation related to IScanningCursor returned by the HashScan operation in this library, which is to be used to continue the hash scan (just like in the documentation I linked). The problem is that the cursor always returns 0, defeating the entire point of using IScanningCursor.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/StackExchange/StackExchange.Redis/issues/729#issuecomment-342588625, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDsM5XnlZyxZLTl4iJlHWrAAjOvhTUks5s0KqlgaJpZM4QBEyk .

Salgat commented 6 years ago

When I ran it I was getting ((IScanningCursor)response).Cursor == 0 even with the default count size and 5000 entries in the hash, even though running it manually through redis-cli gave ~20 per call. This was before even touching/iterating the returned IEnumerable result. At the very least unit tests need to be added to confirm whether it is operating corectly.

NickCraver commented 6 years ago

This is behaving correctly, but it took a bit to track down. The main thing here is there a threshold at which this kicks in, specifically, it's 512 items. Unless the hash has over 512 items (by default), you will only get 1 cursor back.

From the Redis documentation:

When iterating Sets encoded as intsets (small sets composed of just integers), or Hashes and Sorted Sets encoded as ziplists (small hashes and sets composed of small individual values), usually all the elements are returned in the first SCAN call regardless of the COUNT value.

The actual threshold can be determined by: CONFIG GET hash-max-ziplist-entries.

Below this amount, a single cursor with all items will be returned, simply because cursors aren't free and they have overhead. So for small sets, they're not dealt with...they just send it all.

Here's an example repro, change the value between 511 and 513 (or whatever boundaries your server is using) to see the behavior change drastically:

using (var conn = ConnectionMultiplexer.Connect("localhost"))
{
    var count = 513;
    var key = "myHash";
    var db = conn.GetDatabase();
    db.KeyDelete(key);
    var entries = new HashEntry[count];
    for (var i = 0; i < count; i++)
    {
        entries[i] = new HashEntry("Item:" + i, i);
    }
    db.HashSet(key, entries);

    var response = db.HashScan(key);
    var cursor = ((StackExchange.Redis.IScanningCursor)response);
    foreach (var i in response)
    {
        Console.WriteLine($"{i}, Cursor: {cursor.Cursor}, PageSize: {cursor.PageSize}, PageOffset: {cursor.PageOffset}");
    }
}

I added some tests here in e4ef33bda04bdba33da713de7ad431ecd24fb839 and Redis has updated their docs to better explain why this happens.