Cyan4973 / xxHash

Extremely fast non-cryptographic hash algorithm
http://www.xxhash.com/
Other
9.16k stars 777 forks source link

[question] chain many xxhash32() together #701

Closed atesin closed 2 years ago

atesin commented 2 years ago

hi... imagine this situation

is there any way to get the same hash, splitting input data in chunks and chaining xxh32 functions together?... some like

seed 0 --->
--------
GET /myurl HTTP/1.1
Host: my.host
(headers)
---------
--> hash1

hash1 --->
---------
<head>
  <title>my big page</title>
  ...
</head>
---------
---> hash2

hash2 --->
--------
<body>
  <h1>my big page</h1>
  ...
</body>
--------
---> hash3

and get the same result than xxhash32( (http-headers) + (html-head) + (body) ) ??? .... which could be the way to achieve this just with xxh32? (some previous xor operation or something?)

i tried this, but with no success

xxh32("arduino", 0x0) --> 0xa6bb0028 

xxh32("ardu", 0x0)  --> 0xd6ee401e
xxh32("ino", 0xd6ee401e) -->0xbc2fbd92

as you see the results have nothing to do each other... is there any way to "correct" second (splitted) method to make it match with first (whole)??

easyaspi314 commented 2 years ago

You can do that with the streaming variant.

#include <stdint.h>
#include "xxhash.h"
uint32_t hash_arduino()
{
    XXH32_state_t *s = XXH32_createState();
    XXH32_reset(s, 0x0);
    XXH32_update(s, "ardu", 4);
    XXH32_update(s, "ino", 3);
    // add any more you wish
    uint32_t hash = XXH32_digest();
    XXH32_freeState(s);
    return s;
}
atesin commented 2 years ago

yes i'm aware of that .... i don't know if this straming version is compatible with "normal" XXH32 algo, i also use xxh32 in php because i need to produce the same hash between both platforms

additionally, afaik enabling the straming version would add a lot more resource consumption (sketch size, ram usage, cpu cycles)... i don't understand too much about C/C++ so i wrote the arduino library based on suggestions about speed and efficiency (remember atmega328p has just 32kb flash, 2kb ram and 16mhz 8bit cpu), and according the build modifiers i used, i think the streaming version is disabled:

XXH_INLINE_ALL
XXH_NO_STREAM
XXH_NO_STDLIB
XXH_NO_XXH3
XXH_NO_LONG_LONG

now the questions:

thanks for your responses

Cyan4973 commented 2 years ago
  • is this streaming version compatible with normal XXH32 algo?

Yes

  • how enabling streaming version would affect the sketch size and other system resources?

There is a state needed during the duration of streaming operation. It's about ~100 bytes.

The binary size would increase too. I don't know by how much exactly, this depends on too many external factors, I would generally expect something in the range of ~1 KB. Better measure it directly by a compilation exercise on your target system.

  • how can i enable this version (keeping efficiency?)?

Do not specify XXH_NO_STREAM.

For the scenario mentioned in this issue, you would just:

atesin commented 2 years ago

thanks for all @Cyan4973 ... i was doing the compiling excercise while you were writing... these are my findings

normal sketch + normal lib:

El Sketch usa 5050 bytes (15%) del espacio de almacenamiento de programa. El máximo es 32256 bytes.
Las variables Globales usan 239 bytes (11%) de la memoria dinámica, dejando 1809 bytes para las variables locales. El máximo es 2048 bytes.

normal sketch + updated lib:

El Sketch usa 5050 bytes (15%) del espacio de almacenamiento de programa. El máximo es 32256 bytes.
Las variables Globales usan 239 bytes (11%) de la memoria dinámica, dejando 1809 bytes para las variables locales. El máximo es 2048 bytes.

updated sketch + updated lib:

El Sketch usa 7282 bytes (22%) del espacio de almacenamiento de programa. El máximo es 32256 bytes.
Las variables Globales usan 258 bytes (12%) de la memoria dinámica, dejando 1790 bytes para las variables locales. El máximo es 2048 bytes.

as we can see, with the updated lib, sketch size are not affected if streaming functions are not used, but when it does it increases flash in about 2.3kb!... the updated lib has these 2 lines commented

//#define XXH_NO_STREAM
//#define XXH_NO_STDLIB

i also disabed the XXH_NO_STDLIB flag because it states "XXH*_createState() will always fail and return NULL"... the only difference between 2 sketches is i added the function by @easyaspi314 and a couple Serial.print() instructions.... btw easyaspi you got a little typo here: uint32_t hash = XXH32_digest(s);

more questions (3): does streaming version algos reserve some ram space for internal temporary buffers?.. how much?, could it work on the fly without those buffers, or shinking them at least?

i think i will update the arduino lib to add this capability soon, warning about system resources usage

thanks again

Cyan4973 commented 2 years ago

You can get away without using <stdlib> if you allocate the XXH32_state_t yourself, typically statically. In which case, you won't need XXH32_createState(), and therefore no malloc() nor free().

does streaming version algos reserve some ram space for internal temporary buffers?

Everything the streaming version needs is in the state. This includes a very small buffer, which is already counted within the state. There is no need for any additional buffer.

atesin commented 2 years ago

You can get away without using <stdlib>

that exceeds my tiny C++ knowledge... which benefit could we get in enabling/disabling <stdlib> in a platfom like atmega328p (arduino uno soc)?

how can we implement this in the best optimal way in a wrapper class for example?... i am trying to write a helper class to do this that works only with char arrays, and has only 1 private member XXH32_state_t, 1 constructor and 2 methods

private:
  XXH32_state_t stream;
public:
  Xxh32stream();
  static void Xxh32stream::ingest(char*);
  static char* Xxh32stream::digest(char*);

the constructor also resets the state, the digest also frees it, and is reentrant and returns the same supplied buffer pointer (just hex string output)...

i was tinking, as i had seen in some advanced libs, maybe we could benefit ingest method (update state) by extending Print class that has overloaded metods for many data types, and just "print" (update) input data to buffer state_t, without having to re-write methods (re-inventing the wheel) for many data types by ourselves... like this


Everything the streaming version needs is in the state.

if i understood right the state is actually (commit b1a61df, line 1129)

struct XXH32_state_s {
   XXH32_hash_t total_len_32; /*!< Total length hashed, modulo 2^32 */
   XXH32_hash_t large_len;    /*!< Whether the hash is >= 16 (handles @ref total_len_32 overflow) */
   XXH32_hash_t v[4];         /*!< Accumulator lanes */
   XXH32_hash_t mem32[4];     /*!< Internal buffer for partial reads. Treated as unsigned char[16]. */
   XXH32_hash_t memsize;      /*!< Amount of data in @ref mem32 */
   XXH32_hash_t reserved;     /*!< Reserved field. Do not read nor write to it. */
};   /* typedef'd to XXH32_state_t */

(i think the buffer you mean is mem32[4], right?)... i think XXH32_hash_t are actually uint32_t (4 bytes)... so if we have 4x hash_t and 2x arrays of [4], then we have 12 hash_t, which are 48 bytes, right?... funny thing is with streaming enabled and a couple Serial.print()'s sketch grows in more than 2 kilobytes :/ ... idk, maybe is for the new included libraries/functions, idk

i like this hash algo... is tiny, fast and efficient, and well supported

Cyan4973 commented 2 years ago

which benefit could we get in enabling/disabling in a platfom like atmega328p (arduino uno soc)?

This is for freestanding environments. In which case, <stdlib> is either not available, or when provided is statically linked into the application. In which case, not using it results in binary size savings.

But this may not be your case. In most environments, there is a libc runtime. Using it or not doesn't change the fact that it's already there, so there's much less to gain from not using it. Linking & symbols mostly.

if i understood right the state is actually (commit b1a61df, line 1129)

right

(i think the buffer you mean is mem32[4], right?)

right

funny thing is with streaming enabled and a couple Serial.print()'s sketch grows in more than 2 kilobytes :/

Serial.print() itself might add non negligible dependencies, depending if they are already used elsewhere or not. It's hard to say externally.

easyaspi314 commented 2 years ago

What about using -DXXH_SIZEOPT=2

atesin commented 2 years ago

ehh... sorry by answer later, i am running out of time by now ... i did some tests according your advices and could be able to run xxh32 in streaming mode in arduino not consuming too much memory, about the amount we talked, just when stream class is instanced when i have time i will update the lib for arduino... thanks