Three potential additions to stdlib

wclodius2 commented 3 years ago

I now find myself with some free time that I would like to put in to extending the STDLIB. I have some preliminary work on three topics that might be useful to users of STDLIB: error codes, hash tables/unordered maps, and the unicode database. I would appreciate some guidance as to which topic people would find most useful and whether the interest in any topic is high enough to justify inclusion in STDLIB.

For error codes the idea is to standardize how the stdlib identifies and distinguises error conditions. I have a list of 100+ error code names, each with unique integer values. The names currently follow a pattern language in which a suffix is used to indicate the general source of the error: _failure, in which the outside world is not as expected, _error, in which the programmer likely made a mistake, and _fault, in which the processor is likely at fault. This pattern language makes the names a little unwieldy so I am considering dropping it. I have also defined a subroutine that takes an error code as an argument and stops processing with an error stop with an error code dependent string as the stop code. The resulting codes could be made part of stdlib_error.f90 and the subroutine part of an overloaded error_stop. FWIW this approach was inspired by a [posting by Richard A. O'Keefe in the Erlang mail list] (http://erlang.org/pipermail/erlang-questions/2015-March/083608.html).

The hash tables/unordered maps are composed of two or three modules: a module of hash functions, a module of hash tables using key/value pairs to implement an unordered map, and perhaps a module of hash tables using keys to implement an unordered set. The hash functions would have to be non-cryptographic to be legal in the US. Ideally there should be several different functions so the function can be changed in the event of excessive collisions. My idea is to use 64 bit hashes as reasonable for large data sets, though I can be persuaded to use 32 bit hashes. I have tentatively identified the Fowler-Noll-Voh (FNV) hash, the Murmur hash, and the Spooky hash as 64 bit public domain hash functions of interest. If anyone believes that any of these hashes are cryptographic let me know, or if you know of "better" hash functions. The hashes will be a transliteration to signed arithmetic assuming two's complement arithmetic and overflow detection can be turned off. The hash table will store the key and the value as (transferred) eight bit integer arrays. The tables can come in two implementations: chaining and open tables. They will use power of two table sizes, which assumes that the hash functions do an excellent job of randomizing bits.

The Unicode consortium maintains a database of character properties for the UCS character set. This database is the basis of the Python module unicodedata. This database is necessary for assigning code points to categories (e.g., number, letter, and punctuation), converting strings to one of the four normalization forms, and case folding. I consider this data necessary for the implementation of a UCS based varying string module. The initial implementation would be a module with procedures that use 32 bit integer scalars and arrays to represent UCS code points and strings. Several of the Unicode categories ("Bidi", "General Category", "Decomposition", and "Numerical") would be represented by enumerations in category specific types whose only components are eight bit integers. The text files in the UCD and Unihan directories that make up the database take up about 70.5 MBytes, so supporting this database requires a very significant download. The rank one arrays implementing this database take up a comparable amount of runtime storage.

everythingfunctional commented 3 years ago

I personally would vote for hash tables.

jvdp1 commented 3 years ago

Needing often hash tables in my work, I am in favor of hash tables ;) I often use a lookup3-based function.

Note that standardizing the error codes in stdlib would be useful too.

wclodius2 commented 3 years ago

@jvdp1 I have experimented with lookup3 and could include it. Which version do you use, the single 32 bit integer result or the dual 32 bit integer result?

jvdp1 commented 3 years ago

@jvdp1 I have experimented with lookup3 and could include it. Which version do you use, the single 32 bit integer result or the dual 32 bit integer result?

I use a simplified version of the single 32 bit integer function. Is it relevant to implement lookup3 if Spooky hash is available? Starting with the 3 hash functions you proposed would be already nice IMO.

wclodius2 commented 3 years ago

I guess the question is whether it would be useful to have a 32 bit specific hash in addition to the 64 bit hashes? A 32 bit hash:

Is probably sufficient for the vast majority of hash table usages, i.e. tables with significantly less than one million entries;
Will reduce memory traffic in hash comparisons and computations;
Will slightly reduce memory requirements, though the requirements will likely be dominated by the keys and associated data.

jvdp1 commented 3 years ago

I think it could be good to start with 64 bit hashes. 32 bit hash functions may be implemented later if there is interest, or if it is straighforward ti implement..

wclodius2 commented 3 years ago

Generally a thirty two bit hash is simpler than a sixty four bit one. FWIW I want to include a salt (an arbitrary integer of the appropriate kind that changes the mapping of the hash to potentially reduce hash bucket collisions) in the hash function algorithms. I can see two ways of doing it:

Associate with each algorithm a derived type with different implementations (int8 arrays, int16 arrays, character strings, etc) as type bound procedures and the salt as its sole component; or
Have the different implementations be stand alone procedures, with the salt as one argument and a component of the hash table, with the table initialization and remapping setting the salt and mapping of entries using the salt.

My initial implementation used the first, but now I am inclined to the second. Any preferences?

jvdp1 commented 3 years ago

Generally a thirty two bit hash is simpler than a sixty four bit one. FWIW I want to include a salt (an arbitrary integer of the appropriate kind that changes the mapping of the hash to potentially reduce hash bucket collisions) in the hash function algorithms. I can see two ways of doing it:

Associate with each algorithm a derived type with different implementations (int8 arrays, int16 arrays, character strings, etc) as type bound procedures and the salt as its sole component; or

Have the different implementations be stand alone procedures, with the salt as one argument and a component of the hash table, with the table initialization and remapping setting the salt and mapping of entries using the salt.

My initial implementation used the first, but now I am inclined to the second. Any preferences?

I would prefer the second one. But the first one sounds also good!

milancurcic commented 3 years ago

I prefer the second implementation. Thank you for taking the lead, @wclodius2!

wclodius2 commented 3 years ago

With @milancurcic weighing in we now have three people interested in hash functions/tables. Maybe that's enough for inclusion in stdlib. If so the first thing we should do is agree on an API for the hash "functions."

Currently I have two modules for hash functions. stdlib_32_bit_hash_functions.f90 and stdlib_64_bit_hash_functions.f90, for 32 bit and 64 bit hashes respectively. All of the algorithms are implemented in Fortran (I believe F2008), transliterated to signed arithmetic under the assumption that the two's complement arithmetic, used by all modern Fortran processors, will, at the bit level, behave the same as unsigned arithmetic for multiplication, XOR, rotation, and addition. Currently the procedures are functions, as that felt more natural to me, but that is awkward if incremental hashing is desired. Conversion to subroutines for incremental hashing is straightforward, if tedious. All of the algorithm's are in the public domain, but Austin Appleby's Murmur Hashes require that all implementations be in the public domain. I take that to mean that the code if distributed with stdlib can be MIT license, but must be available independent of stdlib as public domain source code, but we should probably check this. All algorithms come in five versions, one each for integer vectors of kinds INT8, INT16, INT32, and INT64, and one for default character strings.

stdlib_32_bit_hash_functions.f90 currently implements 4 hash algorithms: a 32 bit FNV-1 algorithm of Fowler, Noll and Vo that implements a multiply and XOR algorithm; a 32 bit FNV-1a algorithm that reverses the order of the multiply and XOR; the 32 bit Murmur Hash 3 of Austin Appleby that multiplies the current block of input, rotates the product, and XORs with the current hash; and the lookup 3 hash of Bob Jenkins, that uses additions, subtractions, rotations, and xors. FNV-1 and FNV-1a use no salt. Murmur Hash 3 uses a 32 bit integer for its salt that can also be used for incremental hashing. Lookup 3 uses two 32 bit integers for its salt that can also be used for incremental hashing if the option to return two 32 bit integers as its hash is used.

stdlib_64_bit_hash_functions.f90 currently implements 4 hash algorithms: a 64 bit FNV-1 algorithm of Fowler, Noll and Vo that implements a multiply and XOR algorithm; a 64 bit FNV-1a algorithm that reverses the order of the multiply and XOR; the 64 bit Murmur Hash 2 of Austin Appleby that multiplies the current block of input, rotates the product, and XORs with the current hash; and the 64 bit Spooky hash of Bob Jenkins, that uses additions, subtractions, rotations, and xors. FNV-1 and FNV-1a use no salt. Murmur Hash 3 uses a 32 bit integer for its salt that can also be used for incremental hashing. Spooky uses two 64 bit integers for its salt that can also be used for incremental hashing if the option to return two 64 bit integers as its hash is used.

The functions have been tested using a code that generates 2**30 random bytes, and processes them in blocks of 16, 64, 256, and 1024 bytes. Compilation was using ifort 17 as it was giving better error messages than gfortran. For ifort on a MacBook Pro, 2.3 GHz Quad-Core Intel Core i5, with 8 GB 2133 MHz LPDDR3 memory, the hashes of Bob Jenkins were generally the fastest, followed by the hashes of Austin Appleby, followed by the FNV hashes. The 64 bit hashes could be processed in larger chunks than the 32 bit hashes, so the 64 bit hashes were generally faster than the equivalent 32 bit hash. The times in seconds for the different algorithms on the different block sizes ar shown in the tables below where the small the time the faster the processing.

Algorithm	Key Size Bytes	Key #	Time (s)
FNV-1	1024	1048576	8.054
FNV-1	256	4194304	8.056
FNV-1	64	16777216	8.080
FNV-1	16	67108864	8.397
FNV-1a	1024	1048576	8.975
FNV-1a	256	4194304	8.953
FNV-1a	64	16777216	9.090
FNV-1a	16	67108864	9.240
Murmur3	1024	1048576	3.875
Murmur3	256	4194304	3.962
Murmur3	64	16777216	4.057
Murmur3	16	67108864	4.428
Lookup3	1024	1048576	3.558
Lookup3	256	4194304	3.573
Lookup3	64	16777216	3.728
Lookup3	16	67108864	4.156

Algorithm	Key Size Bytes	Key #	Time (s)
FNV-1	1024	1048576	6.002
FNV-1	256	4194304	6.007
FNV-1	64	16777216	6.092
FNV-1	16	67108864	6.399
FNV-1a	1024	1048576	6.508
FNV-1a	256	4194304	6.564
FNV-1a	64	16777216	6.625
FNV-1a	16	67108864	6.933
Murmur2	1024	1048576	1.605
Murmur2	256	4194304	1.625
Murmur2	64	16777216	1.723
Murmur2	16	67108864	2.113
Spooky	1024	1048576	0.370
Spooky	256	4194304	0.631
Spooky	64	16777216	1.321
Spooky	16	67108864	4.512

jvdp1 commented 3 years ago

@wclodius2 Thanks for all these details and implementations. Times look good IMO.

Regarding the API, I would implement them as functions too. I don't see the issue with functions and incremental hashing.

wclodius2 commented 3 years ago

@jvdp1 thanks for the interest.

The timings may be misleading as all of the tests are of multiples of 16 bytes. If the number of bytes in the key is not a multiple of four for the 32 bit hashes, or eight for the 64 bit hashes then Austin Appleby and Bob Jenkins treat the trailing bytes using a switch statement that by itself introduces overhead. Unfortunately, the Fortran equivalent of the switch statement, a computed go to, is deprecated, so I replaced it by a select case statement whose cases are all go to statements to the fall through code of the switch statement. This can in principle be optimized to the equivalent of the switch statement, but I don't know if all compilers can optimize the consecutive double jumps to their equivalent single jump. Therefore I expect that the overhead may be larger for my code.

With regard to functions versus subroutines, I let myself be misled by Bob Jenkins' comments that he recommended the two word 32 bit hashes of Lookup3, and the two word 64 bit hashes of Spooky Hash and based my code on the void function versions of his hashes that input and returned the values in two pointers to unsigned integers. The code would be more natural if I based it on his integer returning functions. The rewrite would be relatively small and would make the API for the hash tables slightly simpler so I plan to implement it over the next week.

jvdp1 commented 3 years ago

Thank you @wclodius2 for the details. I also used select case statement to "translate" the switch statement in the Lookup3 functions. I don't see how to avoid it.

Re: functions vs subroutines, I think that subroutines are ok when returning 2 integers.

wclodius2 commented 3 years ago

I have some new numbers for the runtimes of the various algorithms that include small size keys.

Algorithm	Key Size Bytes	Key #	Time (s)
FNV-1	1	1073741824	12.947
FNV-1	2	536870912	10.383
FNV-1	4	268435456	9.102
FNV-1	8	134217728	8.425
FNV-1	16	67108864	8.286
FNV-1	64	16777216	7.853
FNV-1	256	4194304	7.792
FNV-1	1024	1048576	7.763
FNV-1a	1	1073741824	13.565
FNV-1a	2	536870912	11.181
FNV-1a	4	268435456	9.948
FNV-1a	8	134217728	9.226
FNV-1a	16	67108864	8.959
FNV-1a	64	16777216	8.725
FNV-1a	256	4194304	8.660
FNV-1a	1024	1048576	8.935
murmur3	1	1073741824	19.123
murmur3	2	536870912	13.627
murmur3	4	268435456	5.931
murmur3	8	134217728	4.985
murmur3	16	67108864	4.281
murmur3	64	16777216	3.883
murmur3	256	4194304	3.771
murmur3	1024	1048576	3.750
Lookup3	1	1073741824	18.478
Lookup3	2	536870912	9.323
Lookup3	4	268435456	5.377
Lookup3	8	134217728	4.312
Lookup3	16	67108864	3.983
Lookup3	64	16777216	3.551
Lookup3	256	4194304	3.438
Lookup3	1024	1048576	3.410

Algorithm	Key Size Bytes	Key #	Time (s)
FNV-1	1	1073741824	12.426
FNV-1	2	536870912	9.122
FNV-1	4	268435456	7.525
FNV-1	8	134217728	7.014
FNV-1	16	67108864	6.211
FNV-1	64	16777216	5.888
FNV-1	256	4194304	5.831
FNV-1	1024	1048576	5.788
FNV-1a	1	1073741824	12.860
FNV-1a	2	536870912	9.669
FNV-1a	4	268435456	8.130
FNV-1a	8	134217728	7.268
FNV-1a	16	67108864	6.753
FNV-1a	64	16777216	6.434
FNV-1a	256	4194304	6.555
FNV-1a	1024	1048576	6.699
Murmur2	1	1073741824	14.914
Murmur2	2	536870912	10.898
Murmur2	4	268435456	8.541
Murmur2	8	134217728	2.550
Murmur2	16	67108864	2.040
Murmur2	64	16777216	1.673
Murmur2	256	4194304	1.572
Murmur2	1024	1048576	1.540
Spooky	1	1073741824	25.808
Spooky	2	536870912	16.345
Spooky	4	268435456	8.723
Spooky	8	134217728	3.415
Spooky	16	67108864	2.413
Spooky	64	16777216	0.750
Spooky	256	4194304	0.424
Spooky	1024	1048576	0.294

I propose for the API of the 32 bit hash functions the following

interface hashname_32_bits
    function int8_hashname_wo_seed(key)
        import int8, int32
        integer(int32) :: hashname_wo_seed
        integer(int8), intent(in) :: key(:)
    end function int8_hashname_wo_seed
    function int16_hashname_wo_seed(key)
        import int15, int32
        integer(int32) :: hashname_wo_seed
        integer(int16), intent(in) :: key(:)
    end function int16_hashname_wo_seed
    ...
    function int8_hashname_w_seed(key, seed)
        import int8, int32
        integer(int32) :: hashname_w_seed
        integer(int8), intent(in) :: key(:)
        integer(int32), intent(in) :: seed !  perhaps pass by value
    end function int8_hashname_w_seed
    ...
end interface hashname_32_bits

with something similar for 64 bit hash functions.

I suspect the primary primary use of the hash functions will be for hash tables with typically 4-16 element keys, probably weighted to the low end of that number, typically drawn from the ASCII character set. For UTF-8 character sets, e.g. Linux and Macs, these will probably be be 8-16 bytes, while for UTF-16, e.g., Windows platforms, this might translate to 8-32 byte keys. For these applications 128 bits are overkill. The 32 bit architectures are fading fast, so we should emphasize algorithms tuned for 64 bit architectures. The 32 bit hashes are the best performing for 4 byte keys, while the 64 bit hashes are the best performing for the 16 and 32 byte keys. For the size range of primary interest the FNV hashes perform worst than the other 32 bit hashes, while for 64 bits the FNV hashes are better for the four byte keys, but worse for anything as large as or larger than 8 bytes. The FNV hashes are also generally more prone to collisions than the other hashes and do not show up as well in the SMHasher tests or other tests. The FNV also do not support a seed argument. I therefore propose to replace the FNV hashes.

The most detailed analysis of a large variety of hash functions is that of Jim Apple and Reini Urban, which uses an extended version of SMHasher. This analysis finds the following hashes to be the best performing with no detected problems:

xxh3low
wyhash
t1ha2_atonce
FarmHash (not portable, too machine specific: 64 vs 32bit, old gcc, ...)
halftime_hash128
Spooky32
pengyhash
mx3
MUM/mir (different results on 32/64-bit archs)
fasthash32

In addition a 32 bit hash inspired by the wyhash, the waterhash, claims to also pass the SMHasher tests.

The first question that comes to mind in incorporating translated code is would the associated license allow incorporation in MIT licensed code. The following hashes are obviously compatible as they use the MIT license: Farmhash, [Fasthash}(https://code.google.com/archive/p/fast-hash/), and Mir. The hashes that use the unlicense also appear to be compatible: water hash and wyhash. The BSD 2-clause license requirement to list the copyrights seems to me to be incompatible with the MIT License: so pengyhash and xxh3 appear to be excluded. The mum license also requires copyright publication, but is clearer that modifications are allowed, so I think it is MIT compatible. Halftime hash has no explicit license associated with it so I think it must be excluded. The zlib license of t1ha2 reads to me as compatible with the MIT license. Spooky doesn't have an explicit license, but I have put in a lot of work on it so I am going to send enquiries to Bob Jenkins. Mx3 uses the CC0 (creative commons) license with "no rights reserved" that seems to me to be very MIT compatible.

Another determining factor for me is practicality of implementation. I am not an expert in C++, so I have less confidence in accurately translating longer codes. As a rough rule of thumb I think I can handle codes originally less than 500 Simple Lines of Cod (SLOC) and unwilling to tackle codes longer than 750 SLOC. The results of the above analysis are summarized in the table below.

Hash	License	MIT compatible	SLOC	Acceptable Complexity	Library Acceptable
Farmhash	MIT license	Yes	~11,650	No	No
Fasthash	MIT license	Yes	~80	Yes	Yes
Halftime hash	?	No?	951	No	No
Mir Hash	MIT license	Yes	82	Yes	Yes
Mum hash	Public domain	Yes?	380	Yes	Yes?
Mx3	CC0	Yes	63	Yes	Yes
Pengy hash	BSD 2-Clause	No?	29	Yes	No?
t1ha2	zlib	Yes?	~1500	No	No
Spooky	Public Domain	?	400?	Yes	?
waterhash	unlicense	Yes	50	Yes	Yes
wyhash	unlicense	Yes	142	Yes	Yes
xxh3	BSD 2-Clause	No?	5020	No	No

The reported peak processing speed in MiB/sec (representative of processing large keys) for the acceptable hashes are

Hash	MiB/sec
Fasthash	4737.61
Mir Hash	5413.73
Mum hash	7134.56
Mx3	6146.02
Spooky	9747.13
water hash	8421.17 (different processor)
wyhash	14789.93

Obviously wyhash is of particular interest for processing large keys. For processing small keys we want a code with high performance and negligible startup cost. The startup cost tends to increase with the overall size of the code so for small keys we want high performance with relatively small size. That suggests that one of Mir hash, Mx3, or water hash would be best for small keys. FWIW waterhash is a 32 bit hash, while Mir hash and Mx3 are 64 bit hashes.

jvdp1 commented 3 years ago

I have some new numbers for the runtimes of the various algorithms that include small size keys.

Impressive work @wclodius2. Lots of different implementations

interface hashname_32_bits
    function int8_hashname_wo_seed(key)
        import int8, int32
        integer(int32) :: hashname_wo_seed
        integer(int8), intent(in) :: key(:)
    end function int8_hashname_wo_seed
    ...
    function int8_hashname_w_seed(key, seed)
        import int8, int32
        integer(int32) :: hashname_w_seed
        integer(int8), intent(in) :: key(:)
        integer(int32), intent(in) :: seed !  perhaps pass by value
    end function int8_hashname_w_seed
    ...
end interface hashname_32_bits

Why is the seed not simply an optional, intent(in) argument?

... The FNV also do not support a seed argument. I therefore propose to replace the FNV hashes.

This is fine for me.

The hashes that use the unlicense also appear to be compatible: water hash and wyhash.

They seem to be actually in the public domain. So these hashes should be compatible with the MIT licenses.

The mum license also requires copyright publication, but is clearer that modifications are allowed, so I think it is MIT compatible.

The copyright text in mum.h is the same as the one for the MIT license (if I checked it correctly). So it is ok for me.

The zlib license of t1ha2 reads to me as compatible with the MIT license.

I agree!

Spooky doesn't have an explicit license, but I have put in a lot of work on it so I am going to send enquiries to Bob Jenkins.

The files mention that it is in public domain. Therefore, I don't see any issue.

lookup3.c even contains the following lines:

 It's in the public domain.  It has no warranty.
....

By Bob Jenkins, 2006.  bob_jenkins@burtleburtle.net.  You may use this
code any way you wish, private, educational, or commercial.  It's free.

Mx3 uses the CC0 (creative commons) license with "no rights reserved" that seems to me to be very MIT compatible.

CC0 is similar to "public domain" (following wikipedia). Therefore, it should be compatible with MIT license.

Hash License MIT compatible SLOC Acceptable Complexity Library Acceptable Farmhash MIT license Yes ~11,650 No No Fasthash MIT license Yes ~80 Yes Yes Halftime hash ? No? 951 No No Mir Hash MIT license Yes 82 Yes Yes Mum hash Public domain Yes? 380 Yes Yes? Mx3 CC0 Yes 63 Yes Yes Pengy hash BSD 2-Clause No? 29 Yes No? t1ha2 zlib Yes? ~1500 No No Spooky Public Domain ? 400? Yes ? waterhash unlicense Yes 50 Yes Yes wyhash unlicense Yes 142 Yes Yes xxh3 BSD 2-Clause No? 5020 No No

I don't see lookup3 in the above table. Is it on purpose?

The reported peak processing speed in MiB/sec (representative of processing large keys) for the acceptable hashes are

Hash MiB/sec Fasthash 4737.61 Mir Hash 5413.73 Mum hash 7134.56 Mx3 6146.02 Spooky 9747.13 water hash 8421.17 (different processor) wyhash 14789.93 Obviously wyhash is of particular interest for processing large keys. For processing small keys we want a code with high performance and negligible startup cost. The startup cost tends to increase with the overall size of the code so for small keys we want high performance with relatively small size. That suggests that one of Mir hash, Mx3, or water hash would be best for small keys. FWIW waterhash is a 32 bit hash, while Mir hash and Mx3 are 64 bit hashes.

Interesting analysis! Users will have a lot of choices!

wclodius2 commented 3 years ago

On Feb 26, 2021, at 11:51 AM, Jeremie Vandenplas notifications@github.com wrote:

I have some new numbers for the runtimes of the various algorithms that include small size keys.

Impressive work @wclodius2 https://github.com/wclodius2. Lots of different implementations

interface hashname_32_bits function int8_hashname_wo_seed(key) import int8, int32 integer(int32) :: hashname_wo_seed integer(int8), intent(in) :: key(:) end function int8_hashname_wo_seed ... function int8_hashname_w_seed(key, seed) import int8, int32 integer(int32) :: hashname_w_seed integer(int8), intent(in) :: key(:) integer(int32), intent(in) :: seed ! perhaps pass by value end function int8_hashname_w_seed ... end interface hashname_32_bits Why is the seed not simply an optional, intent(in) argument?

The optional attribute results in an initial test and branch and so should have slightly poorer performance for small keys.

By Bob Jenkins, 2006. bob_jenkins@burtleburtle.net. You may use this code any way you wish, private, educational, or commercial. It's free. Mx3 uses the CC0 (creative commons) license with "no rights reserved" that seems to me to be very MIT compatible. CC0 is similar to "public domain" (following wikipedia ). Therefore, it should be compatible with MIT license. Hash License MIT compatible SLOC Acceptable Complexity Library Acceptable Farmhash MIT license Yes ~11,650 No No Fasthash MIT license Yes ~80 Yes Yes Halftime hash ? No? 951 No No Mir Hash MIT license Yes 82 Yes Yes Mum hash Public domain Yes? 380 Yes Yes? Mx3 CC0 Yes 63 Yes Yes Pengy hash BSD 2-Clause No? 29 Yes No? t1ha2 zlib Yes? ~1500 No No Spooky Public Domain ? 400? Yes ? waterhash unlicense Yes 50 Yes Yes wyhash unlicense Yes 142 Yes Yes xxh3 BSD 2-Clause No? 5020 No No I don't see lookup3 in the above table. Is it on purpose?

lookup3 does not pass all the hash quality tests of SMHasher. Neither do MURMUR2 and MURMUR3. I propose to replace them in my current codes.

The reported peak processing speed in MiB/sec (representative of processing large keys) for the acceptable hashes are

Hash MiB/sec Fasthash 4737.61 Mir Hash 5413.73 Mum hash 7134.56 Mx3 6146.02 Spooky 9747.13 water hash 8421.17 (different processor) wyhash 14789.93 Obviously wyhash is of particular interest for processing large keys. For processing small keys we want a code with high performance and negligible startup cost. The startup cost tends to increase with the overall size of the code so for small keys we want high performance with relatively small size. That suggests that one of Mir hash, Mx3, or water hash would be best for small keys. FWIW waterhash is a 32 bit hash, while Mir hash and Mx3 are 64 bit hashes.

Interesting analysis! Users will have a lot of choices!

What I propose to do:

Replace Murmur2 and Murmur3 with wyhash.
Replace FNV-1 and FNV-1a in 32 bit hashes with water hash.
Replace FNV-1 and FNV-1a in 64 bit hashes with either Mir hash or Mx3.
Replace lookup3 with the 32 bit version of Spooky hash.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/262#issuecomment-786830303, or unsubscribe https://github.com/notifications/unsubscribe-auth/APTQDOSJE43XIZWTIXQHVELTA7UU7ANCNFSM4UUOMDZA.

jvdp1 commented 3 years ago

lookup3 does not pass all the hash quality tests of SMHasher. Neither do MURMUR2 and MURMUR3. I propose to replace them in my current codes.

I didn't know that (actually I never checked it). Thank you for the information.

What I propose to do: 1. Replace Murmur2 and Murmur3 with wyhash. 2. Replace FNV-1 and FNV-1a in 32 bit hashes with water hash. 3. Replace FNV-1 and FNV-1a in 64 bit hashes with either Mir hash or Mx3. 4. Replace lookup3 with the 32 bit version of Spooky hash.

It sounds a good and robust plan.

wclodius2 commented 3 years ago

After checking the source code for the algorithms of interest, all of the recent C/C++ codes use a seed argument. I therefore propose to require a seed argument and use the effective API

interface hashname_32_bits
    function int8_hashname(key, seed)
        import int8, int32
        integer(int32) :: int8_hashname
        integer(int8), intent(in) :: key(:)
        integer(int32), intent(in) :: seed
    end function int8_hashname
    function int16_hashname(key, seed)
        import int16, int32
        integer(int32) :: int16_hashname
        integer(int16), intent(in) :: key(:)
        integer(int32), intent(in) :: seed
    end function int16_hashname
    ...
    function char_hashname(key, seed)
        import int32
        integer(int32) :: char_hashname
        character(*), intent(in) :: key
        integer(int32), intent(in) :: seed
    end function char_hashname
end interface hashname_32_bits

jvdp1 commented 3 years ago

I therefore propose to require a seed argument and use the effective API

It sounds good to me. Can these functions be pure? Probably better to discuss this when the PR will be submitted.

wclodius2 commented 3 years ago

The ones I am considering are all pure. FWIW the 32 bit hashers all seem to use a 64 bit seed.

wclodius2 commented 3 years ago

FWIW I am considering having a header, similar to the following, in each pertinent file:

!!------------------------------------------------------------------------------ !! WATER_HASH is a translation to Fortran 2008 of the waterhash algorithm !! of Tommy Ettinger. Tommy Ettinger's original C++ code, waterhash.h, is !! available at the URL: https://github.com/tommyettinger/waterhash under the !! unlicense, https://github.com/tommyettinger/waterhash/blob/master/LICENSE. !! "waterhash is a variant on Wang Yi's wyhash, with 32 bit output, !! using at most 64 bit arithmetic. wyhash is available at the URL: !! https://github.com/wangyi-fudan/wyhash also under the unlicense: !! https://github.com/wangyi-fudan/wyhash/blob/master/LICENSE. !! Original Author: Wang Yi godspeed_china@yeah.net !! Waterhash Variant Author: Tommy Ettinger tommy.ettinger@gmail.com !! !! The unlicense reads as follows: !! This is free and unencumbered software released into the public domain. !! !! Anyone is free to copy, modify, publish, use, compile, sell, or !! distribute this software, either in source code form or as a compiled !! binary, for any purpose, commercial or non-commercial, and by any !! means. !! !! In jurisdictions that recognize copyright laws, the author or authors !! of this software dedicate any and all copyright interest in the !! software to the public domain. We make this dedication for the benefit !! of the public at large and to the detriment of our heirs and !! successors. We intend this dedication to be an overt act of !! relinquishment in perpetuity of all present and future rights to this !! software under copyright law. !! !! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, !! EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF !! MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. !! IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR !! OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, !! ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR !! OTHER DEALINGS IN THE SOFTWARE. !! !! For more information, please refer to http://unlicense.org !! !! WATER_HASH is distributed as part of the stdlib_32_bit_hash_functions.f90 !! module and its stdlib_32_bit_water_hashes.f90 submodule with the Fortran !! Standard Library at URL: https://github.com/fortran-lang/stdlib. !! The Fortran Standard Library, including this code, is distributed under the !! MIT License as described in the LICENSE file distributed with the library. !! WATER_HASH differs from waterhash.h not only in its use of Fortran, !! but also in its use of signed two's complement arithmetic in contrast to !! the unsigned arithmetic of Ettinger and Wang Yi, and in making some of the !! uses of TRANSFER endian dependent, in an attempt to make the quality of !! the hash endian independent. The use of signed arithmetic may change with !! the planned introduction of the unsigned BITS datatype in what is currently !! known as Fortran 202X. !! !! To be useful this code must be processed by a processor that implements two !! Fortran 2008 extensions to Fortran 2003: submodules, and 64 bit (INT64) !! integers. The processor must also use two's complement integers !! (all Fortran 95+ processors use two's complement arithmetic) with !! wrap around overflow at runtime and for BOZ constants. The latest releases !! of the following processors are known to implement the required Fortran !! 2008 extensions and default to runtime wrap around overflow: FLANG, !! gfortran, ifort, and NAG Fortran. Older versions of gfortran will require !! the compiler flag, -fno-range-check, to ensure wrap around semantics !! for BOZ constants, and only versions of the NAG compiler starting with !! version 17, have implemented submodules. The latest releases of Cray !! Fortran and IBM Fortran are known to implement the Fortran 2008 extensions, !! but whether they also implement wrap around overflow is unknown. !! !! This implementation has only been tested on little endian processors. It !! will generate different hashes on big endian processors, but they are !! believed to be of comparable quality to those generated for little endian !! processors. !! !! No version of this hash is suitable as a cryptographic hash. !!------------------------------------------------------------------------------

jvdp1 commented 3 years ago

@wclodius2 It sounds good to me, and fair towards the original authors.

wclodius2 commented 3 years ago

Richter et al, "A Seven-Dimensional Analysis of Hashing Methods and its Implications on Query Processing" is a good overview of the tradeoffs between hash function methods, and hash table methods. M. Thorup, "High Speed Hashing for Integers and Strings" is a good overview of some efficient simple hash functions unfortunately with a different API from the better known hash functions.

wclodius2 commented 3 years ago

I now have a number of different hash functions most with different APIs. In order to narrow down the number of hash functions I decided to look at what I should be using for a hash table. My current implementations are inspired by a simple C based table of David Chase. I decided that this was probably not as well optimized as we would like for the standard library, and I should probably look at some modern implementations of hash tables, particularly ones that give better performance on 64 bit processors. I then found the benchmark by Martin Ankerl Hashmaps Benchmarks - Overview. In it he examines 20 different C++ hash table implementations with five different hashing functions. All of the ones he recommends make too much use of templates (and often assembly) to simply translate to Fortran 2008. He also recommends the Youtube lecture CppCon 2017: Matt Kulukundis “Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step”. There is also a more limited, slightly older, evaluation of hash table implementations, Revisiting hash table performance. I also found sone guides on hash table techniques: Advanced techniques to implement fast hash tables, and Deletion from hash tables without tombstones My conclusions from the above and Richter et al for a Fortran standard Library hash table:

Use 64 bit hashes, their calculation and use on 64 bit machines are cheaper than 32 bit hashes and 32 bit machines are disappearing rapidly.
Consider using a secondary hash function to map hash values to bucket indices,
Consider using as secondary hash functions, Martin Ankerl's hash functions, hash_int and hash, from his [robin_hood.h].(https://github.com/martinus/robin-hood-hashing/blob/master/src/include/robin_hood.h) hash table code.
Use an Open table as they are consistently more efficient than chaining tables for small keys.
Consider using Robin Hood probing to resolve collisions in the open table as it is used by all but the most sophisticated high performance tables.
Provide two categories of hash tables, hash_map (for key-value pairs) and hash_set (for keys without values).
Allow two different types of keys in each category: INT8 arrays as stand-ins for integer arrays and character strings, and simple INT64 scalars as stand-ins for any scalar type key.
Consider a new way of dealing with deleted entries.
Examine other C implementations for ideas, e.g. khash.h.

awvwgk commented 3 years ago

I added a waterhash implementation in #452, testing against the C version my Fortran portation seems to work correctly for all cases I checked so far.

wclodius2 commented 2 years ago

I have uploaded a markdown document describing my proposed API for hash functions in the Fortran Standard Library at hash_functions.md. I have largely followed the guidelines of Reini Urban, providing very simple functions (FNV-1 and FNV-1a) for use in typical tables, and functions with no known statistical problems and reasonable performance particularly on large keys (nmhash32, nmhash32x, waterhash, pengy hash, and spooky hash) for more sophisticated applications.

FWIW Reini Urban lists the fastest hashes that provide no statistical problems. Listed in order of decreasing performance with my comments on their potential implementation:

xxh3low: very sophisticated implementation with lots of reliance on preprocessing to exploit processor dependent vector instructions for its fastest versions. Difficult to exploit in Fortran.
wyhash: straightforward to implement, but has a secret argument in addition to the typical key and seed arguments that makes it more awkward to use. I have an implementation of this function if others think it should be included in the API in spite of this complication. Needs to have bad seeds filtered out. Waterhash is a simplified version of wyhash without the complicated API and yielding a 32 bit hash code.
ahash64: is a sophisticated implementation in the Rust programming language. I have trouble following the logic.
t1ha2_atonce: main code 303 sloc but relies on header file with 1096 sloc
FarmHash (not portable, too machine specific: 64 vs 32bit, old gcc, ...)
halftime_hash128: I have trouble following the templated logic of the 951 sloc.
Spooky32: I don't know why he calls out the 32 bit implementation, when the 64 and 128 bit implementations report no problems and are just as fast. Implemented.
pengyhash: Implementated
nmhash32: implemented
mx3: Implemented, but has poor performance.
MUM/mir: (different results on 32/64-bit archs, lots of bad seeds to filter out) mir is implemented but had mediocre performance and needs to have bad seeds filtered out.
fasthash32: not as good performance as the others. fasthash64 has poor statistics.

wclodius2 commented 2 years ago

I have now discovered that the version of Spooky being tested in SMHasher is version 1 while the version I have implemented is V2, which, apparently, does not pass the SMHasher tests. I am surprised at this as the described differences in implementation are such that I would expect that V2 would be more likely to pass the tests than V1. Given the smallness of the differences between versions I think it will be straightforward to "update" the code to V1.

wclodius2 commented 2 years ago

I have "updated" my version of SpookyHash to version 1. I have also updated hash_functions.md to

Reflect the changes in spooky hash.
Reformat the examples so that indentation is consistent.
Changed the discussion of the tests to reflect changes in the tests to reduce runtime.

I am now considering uploading my source codes. Can I use Sebastian Ehlert's (@awvwgk) waterhash PR as a guide for developing an Fpm project for them?

jvdp1 commented 2 years ago

I am now considering uploading my source codes. Can I use Sebastian Ehlert's (@awvwgk) waterhash PR as a guide for developing an Fpm project for them?

@wclodius2 I am not sure to understand correctly your question. Once your code will be integrated within stdlib, it will be automatically available in fpm.

wclodius2 commented 2 years ago

I am not certain I understand my question. I think I wrote it when I was tired.

wclodius2 commented 2 years ago

FWIW I have just released a proposed hash map API on the website with my proposed hash function API. The API defines three modules: stdlib_32_bit_key_data_wrapper, stdlib_chaining_hash_map and stdlib_open_hash_map.

The stdlib_32_bit_key_data_wrapper module does the following: defines an abstract interface for hash functions, provides a wrapper for some hash functions so they match that interface; defines a key datatype, key_type whose contents are currently not private, but can be made abstract with setters and getters that take INT8 vectors and character strings as arguments; defines another datatype, other_type, for data that supplements the key, whose contents are currently not private, but can be made abstract with setters and getters that take INT8 vectors and character strings as arguments.

The module stdlib_chaining_hash_map implements a datatype, chaining_hash_map_type, for a simple separate chaining hash map. The API provides most of the functionality @awvwgk requests except it doesn't provide a procedure for returning a list of all keys, but that procedure is easy to implement.

The module stdlib_open_hash_map implements a datatype, open_hash_map_type, for a simple open addressing hash map with linear insertion. The API doesn't provide for returning a list of all keys, but that is easy to implement, and it doesn't provide a procedure for deleting elements. It is possible to provide an element deletion procedure for an open addressing hash map, but it is tricky to implement and the runtime cost is significantly higher than for a chaining hash map.

The API's for the two data types are very similar. The init_map procedure for open_hash_map_type has an additional load_factor argument missing from the chaining_hash_map_type. The open_hash_map_type also has an additional load_factor inquiry function. The chaining_hash_map_type also has the remove_entry subroutine that open_hash_map_type lacks.

The two APIs also have some inquiry functions on the structure and history of the hash maps that could be eliminated. In particular I think the total_depth function is less useful to casual users.

wclodius2 commented 2 years ago

FWIW I have just changed the implementation of key_type and other_type to be opaque with getters and setters and have updated hash_maps.md accordingly.

jvdp1 commented 2 years ago

@wclodius2 it seems you are quite advanced in the code and specs. Would it be a good time to open a PR? If you plan to do so, I suggest that you open 2-3 different PR (e.g., 1 for only hash functions, 1 for hash maps,.... if possible of course). Otherwise, based on your description, I am afraid that one single PR could be too long for reviewing efficiently.

awvwgk commented 2 years ago

What is the motivation behind other_type? Would this data also be used to identify the entry in the map or is this only determined by the key? Additional data for key would be part of the value instead, wouldn't it?

wclodius2 commented 2 years ago

The hash map codes should not be put into a PR independent of the hash functions. The hash map and hash function codes should either be put in a single PR, or the hash functions put into a PR and the hash map codes put into a PR only after the hash function PR is accepted. The codes are almost ready for release as a PR, but before I do so I want to do at least the following:

Create a test code that checks the validity of the hash functions by comparing their outputs to the outputs of the C equivalent. While for the shorter codes I feel that inspection of the codes is sufficient, the Spooky hash is complicated enough that I want to do testing on it, and, as long as I am testing that, I will find it easy to test the other codes. Unfortunately I am not experienced with C interoperability.
Add a remove_entry subroutine to the stdlib_open_hash_map module. I have finally figured out how to do remove an entry from an open addressing hash map.

In the meantime I would appreciate any comments anyone may have on the proposed APIs in the markdown documents hash_functions.md and hash_maps.md.

jvdp1 commented 2 years ago

The hash map and hash function codes should either be put in a single PR, or the hash functions put into a PR and the hash map codes put into a PR only after the hash function PR is accepted.

An alternative is to open 2 PR, one (hash functions) ready for reviewing and one (hash map) as a draft with an explicit mention that it is dependent on the PR of the hash functions.

I am afraid that if both are submitted in one PR, it will take too much energy to review it, and will slow down the process.

The codes are almost ready for release as a PR, but before I do so I want to do at least the following:

If a PR is opened, some of us could most likely help you.

Unfortunately I am not experienced with C interoperability.

What is needed for this? Is it not possible to avoid to call the C equivalent?

In the meantime I would appreciate any comments anyone may have on the proposed APIs in the markdown documents hash_functions.md and hash_maps.md.

It would be easier for the people to comment them if a PR was opened. Anayway, I will already read them.

wclodius2 commented 2 years ago

The identity of the entry in the map is only determined by the key. The name other_data could be changed, but it is intended to represent all ancillary data associated with the key. If you want to call it the "value" of the key you can. The user would need to provide a procedure to map the other data to and from either an INT8 vector or a default character string.

wclodius2 commented 2 years ago

I should note that the API provides the ability change the other data associated with the key.

wclodius2 commented 2 years ago

@jvdp1 I will submit the hash functions for a PR in about two weeks. I will wait until that PR has run its course before submitting the hash maps for a PR.

I want to learn to use C interoperability, so I need to sit down to read for a couple of hours that part of the Fortran standard and a couple of Fortran texts. After that I need to:

Download the original C/C++ codes.
Setup CMakeLists so that cmake generates a makefile to compile the codes into a library.
Write a code that calls both my hash codes in the Fortran Library and the hash codes in the C library and applies comparable codes to the same keys comparing the results to ensure that they yield identical hashes.
Set up CMakeLists so that cmake links the libraries and the test codes.
Run the test code to verify that my implementations are working.

If I can work about 4 hours a day I think I can do that in about two weeks.

jvdp1 commented 2 years ago

Nice plan @wclodius2 . I can't wait to see it ;)

wclodius2 commented 2 years ago

Reini Urban has now decided that Spooky Version 2 is better than version 1. I'll wait a bit for more details, but plan to revert to my Version 2 code.

jvdp1 commented 2 years ago

Reini Urban has now decided that Spooky Version 2 is better than version 1. I'll wait a bit for more details, but plan to revert to my Version 2 code.

I am not able to say which one is the best. However, my strategy would be to provide both, and let the user choose the one he wants to use.

wclodius2 commented 2 years ago

No I think we should only provide the better of the two, and right now I think that is SpookyV2.

My status at the moment:

I have created validation tests (comparisons with the outputs of the original C/C++ codes) for the more complicated 64 bit hashes, pengy hash and SpookyV2, and for the more complicated 32 bit hashes, water hash, nmhash32, and nmhash32x. Pengy hash, SpookyV2, and water hash are passing all their tests, handling INT8 arrays of length 0 to 512. Both nmhash32 and nmhash32x are failing the simplest test, a zero length key array. They undoubtably fail more than that, but I am focussing on that case for the moment. Note that the implementation of nmhash.h changed recently to V0.2 and that was the only source code I had available to compare with so I had to reimplement the Fortran version.

For nmhash32 this test case has simple to follow logic. It reads the size of the key, does a simple assignment to two variables, and then calls a function to mix the bits of the two variables, returning the result. The C function uses macros to define different implementations of the mixing function depending on what, if any, vector instructions are available. The Fortran code implements the scalar version, so I reset the pertinent macro variable so that only the scalar version is compiled. No change. Currently I see nothing else that could cause problems. Things I might not be seeing:

The macro variable may not be reset properly so the vector code in the function may be compiled contrary to my intent.
The Fortran and C code for the function may differ in their effects.
The Fortran and C code for the main procedure may differ in their effects.
The C interoperability wrappers for the C code may be incorrect. However they look similar in effect to the wrappers for the other C files.
The makefile for the validation is a hack, and does not, by itself, update the linked code to reflect changes in the Fortran nmhash source. There is a kluge for this, and I used it just before writing this.

I am going to set the nmhash32 code aside for a couple of days so I can look at it afresh. In the meantime, I think the code as a whole is useful as it is. I am going to checkout the standard library tomorrow, create a branch, and incorporate the hash functions and documentation in the branch. I will then look at the nmhash32 procedures, and if I can fix them great, but in any case I will upload the branch for a PR and in the worst case I might be looking for suggestions for fixes for the nmhash32 based on others' review of the codes.

wclodius2 commented 2 years ago

I have fixed nmhash32 and nmhash32x so that they pass the validation tests, and have submitted the resulting code and documentation for a PR.

wclodius2 commented 2 years ago

The hash functions PR appears to be approaching acceptance. As a result I am thinking of getting my hash maps work ready for a PR. I currently have two hash maps working, one using open hashing, the other using chained hashing. They have the same API differing only in the name of the hash map type. It is straight forward but tedious to convert both of them to be descendants of the same abstract type. I am slightly concerned that that change will result in a level of indirection that would have a negative performance impact. Does anyone have a strong opinion on such a change?

jvdp1 commented 2 years ago

I see only now this message. sorry. I would be in favour of keeping them separated.

fortran-lang / stdlib

Three potential additions to stdlib #262