fortran-lang / stdlib

Fortran Standard Library
https://stdlib.fortran-lang.org
MIT License
1.03k stars 163 forks source link

Abstract base class for map types #479

Open awvwgk opened 3 years ago

awvwgk commented 3 years ago

Description This issue should track a potential addition of a map class to stdlib. I think it would be preferable to first define an abstract base class which covers most of our needs. Here are some points I found important when working on TOML Fortran:

Examples

Abstract base class used in TOML Fortran to define maps ```f90 type, abstract :: map_class contains !> Find a value based on its key procedure(find), deferred :: find !> Push back a value to the structure procedure(push_back), deferred :: push_back !> Get list of all keys in the structure procedure(get_keys), deferred :: get_keys !> Delete a value at a given key procedure(delete), deferred :: delete !> Destroy the data structure procedure(destroy), deferred :: destroy end type abstract interface !> Find a value based on its key subroutine find(self, key, ptr) import :: map_class, value_type !> Instance of the structure class(map_class), intent(inout), target :: self !> Key to the value character(len=*), intent(in) :: key !> Pointer to the stored value at given key type(value_type), pointer, intent(out) :: ptr end subroutine find !> Push back a value to the structure subroutine push_back(self, val) import :: map_class, value_type !> Instance of the structure class(map_class), intent(inout), target :: self !> Value to be stored type(value_type), allocatable, intent(inout) :: val end subroutine push_back !> Get list of all keys in the structure subroutine get_keys(self, list) import :: map_class, string_type !> Instance of the structure class(map_class), intent(inout), target :: self !> List of all keys type(string_type), allocatable, intent(out) :: list(:) end subroutine get_keys !> Delete a value at a given key subroutine delete(self, key) import :: map_class, value_type !> Instance of the structure class(map_class), intent(inout), target :: self !> Key to the value character(len=*), intent(in) :: key end subroutine delete !> Deconstructor for data structure subroutine destroy(self) import :: map_class !> Instance of the structure class(map_class), intent(inout), target :: self end subroutine destroy end interface ```
milancurcic commented 2 years ago

Is map the same as dictionary (e.g. Python's dict) or do they differ in some ways?

In Python I work with this a lot. In Fortran, not really. But this is one of the features that I think, even if not presently used a lot, would open up many possibilities for development of high level libraries. For example, databases (SQL-style or otherwise), various file formats, parsing HTML, SVG, and other kinds of XML, HTTP clients and servers.

Fortran-native approach is the way to go.

An alternative, perhaps for its own package rather than stdlib, would be Fortran bindings for Redis, which provides high performance in-memory dicts and lists.

arjenmarkus commented 2 years ago

I have worked on a Fortran interface to Redis, but I ran into odd problems that I simply did not understand. I would have to look up my notes, but if there is interest I can revive it ;).

Op wo 25 aug. 2021 om 17:00 schreef Milan Curcic @.***>:

Is map the same as dictionary (e.g. Python's dict) or do they differ in some ways?

In Python I work with this a lot. In Fortran, not really. But this is one of the features that I think, even if not presently used a lot, would open up many possibilities for development of high level libraries. For example, databases (SQL-style or otherwise), various file formats, parsing HTML, SVG, and other kinds of XML, HTTP clients and servers.

Fortran-native approach is the way to go.

An alternative, perhaps for its own package rather than stdlib, would be Fortran bindings for Redis https://redis.io/, which provides high performance in-memory dicts and lists.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/479#issuecomment-905582297, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR7BPOKEAPHPCNWKYEDT6UAQFANCNFSM5BIOAAUQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

awvwgk commented 2 years ago

I was thinking C++ ordered map, but Python dictionary is the same thing.

Having an abstract base class would allow to implement the map type with any backend, stdlib would only provide the base class and pure Fortran implementations. Other (fpm) projects could extend the base class and provide their own optimized implementation.

The advantage of a single well-designed abstract base class would be that projects like TOML Fortran don't have to implement their own, but you could simply load a TOML document into a stdlib map type and use it with any other library which supports stdlib's maps. Designed correctly a library could also load data directly into an optimized map implementation.

FObermaier commented 2 years ago

For the record, are you aware of fhash?

wclodius2 commented 2 years ago

FWIW I have just released a proposed hash map API on the website with my proposed hash function API. The API defines three modules: stdlib_32_bit_key_data_wrapper, stdlib_chaining_hash_map and stdlib_open_hash_map.

The stdlib_32_bit_key_data_wrapper module does the following: defines an abstract interface for hash functions, provides a wrapper for some hash functions so they match that interface; defines a key datatype, key_type whose contents are currently not private, but can be made abstract with setters and getters that take INT8 vectors and character strings as arguments; defines another datatype, other_type, for data that supplements the key, whose contents are currently not private, but can be made abstract with setters and getters that take INT8 vectors and character strings as arguments.

The module stdlib_chaining_hash_map implements a datatype, chaining_hash_map_type, for a simple separate chaining hash map. The API provides most of the functionality @awvwgk requests except it doesn't provide a procedure for returning a list of all keys, but that procedure is easy to implement.

The module stdlib_open_hash_map implements a datatype, open_hash_map_type, for a simple open addressing hash map with linear insertion. The API provides most of the functionality @awvwgk requests except it doesn't provide for returning a list of all keys, but that is easy to implement, and it doesn't provide a procedure for deleting elements. It is possible to provide an element deletion procedure for an open addressing hash map, but it is tricky to implement and the runtime cost is significantly higher than for a chaining hash map.

The API's for the two data types are very similar. The init_map procedure for open_hash_map_type has an additional load_factor argument missing from the chaining_hash_map_type. The open_hash_map_type also has an additional load_factor inquiry function. The chaining_hash_map_type also has the remove_entry subroutine that open_hash_map_type lacks.

The two APIs also have some inquiry functions on the structure and history of the hash maps that could be eliminated. In particular I think the total_depth function is less useful to casual users.

wclodius2 commented 2 years ago

FWIW I have just changed the implementation of key_type and other_type to be opaque with getters and setters and have updated hash_maps.md accordingly.

LKedward commented 2 years ago

For the record, are you aware of fhash?

There's also my own implementation of the same name: LKedward/fhash. (My focus for that project was mostly on providing a nice simple API without preprocessing - the underlying implementation is nothing special.)

awvwgk commented 2 years ago

Would it make sense to introduce an abstract base class which both implementation inherit from? Except for the delete feature which is absent in the open addressing implementation the APIs seem pretty much similar.

wclodius2 commented 2 years ago

@FObermaier the flash you refer to is based on the gcc hash table and, as a result, is probably covered by the GNU LGPL license, which I believe is not compatible with the MIT license used by the Fortran Standard Library.

wclodius2 commented 2 years ago

@awvwgk I have tried to keep the APIs for the hash tables very similar. The init procedures differ, but that is because I provide additional features through optional arguments that most users would not use. I should probably reduce it to four arguments: map specifying the hash table to be initialized, hasher specifying the hash function to be used by the table, and the optional slots_bits specifying the initial size of the table, and status for reporting errors with the initialization. This drops the max_bits and load_factor arguments. If load_factor is a constant then I probably don't need the load_factor inquiry function for the open_hash_map_type. I also have figured out how to implement the remove_entry procedure for the open_hash_map_type and plan to implement it. With those changes the APIs would be identical, and giving them a common base class would be straight forward, but tedious.

FObermaier commented 2 years ago

the fhash you refer to is based on the gcc hash table and, as a result, is probably covered by the GNU LGPL license, which I believe is not compatible with the MIT license used by the Fortran Standard Library.

No, it is not based on it, it is an implemention of the GCC hashmap structure in Fortran. Not a port of the GCC code. I think that is fine and it is MIT licensed, too.

LecrisUT commented 1 year ago

Any news on this?

jvdp1 commented 1 year ago

@LecrisUT Hash maps have been implemented in stdlib. See here for more detail. Is it what you are looking for?

LecrisUT commented 1 year ago

The documentation is immense and hard to navigate. Is there a rosetta stone of how to do the python equivalent of:

dict = {}
dict.__contains__("key")
dict["key"] = 2
dict.pop("key")
jvdp1 commented 1 year ago

@LecrisUT I agree with you that the specs are quite difficult to follow and is not written for users. Unfortunately, I never found the time to write a tutorial. Anyway, I just wrote a small example for you (see below). You can easily compile it with fpm (I used fpm v0.9.0 with the dependency stdlib="*"). Is such an example useful for you?

program main
 use stdlib_kinds, only: int8
 use stdlib_hashmap_wrappers, only: fnv_1_hasher, key_type, other_type, set, get
 use stdlib_hashmaps, only: chaining_hashmap_type, default_bits
 implicit none
 integer, parameter :: nentries = 50

 type dummy_type
  integer :: i
  real :: myval(4)
 end type dummy_type

 integer :: i
 logical :: conflict, exists
 type(dummy_type) :: dummy

 type(key_type) :: key
 type(other_type) :: other
 type(chaining_hashmap_type) :: map
 class(*), allocatable :: data

 call map%init(fnv_1_hasher)

 !Storing data
 do i = 1, nentries
  dummy%i = i
  call random_number(dummy%myval)
  call set(key, [transfer("dummy", 1_int8, len("dummy")), transfer(i, 1_int8)])
  call set(other, dummy)
  call map%map_entry(key, other, conflict)
  if(conflict) error stop "Unable to map entry because of a key conflict"
 enddo

 write(*, '(a)')'Querying table info...'
 write(*, '(a,t40,i0)')'Number of buckets allocated: ',map%num_slots()
 write(*, '(a,t40,i0)')'Number of key-value pairs stored: ',map%entries()
 write(*, '(a,t40,i0)')'The worst case bucket depth is ',map%total_depth()

 !Retrieving data
 do i = 1, nentries
  call set(key, [transfer("dummy", 1_int8, len("dummy")), transfer(i, 1_int8)])
  call map%get_other_data(key, other, exists)
  if(.not.exists)write(*,'(a)')'Warning: missing key other'

  call get(other, data)

  select type (data)
  type is (dummy_type)
   print *, 'Other data % i = ', data%i
   print *, 'Other data % myval = ', data%myval
  class default
   print *, 'Invalid data type in other'
  end select
 enddo

end program main
LecrisUT commented 1 year ago

So to get this straight:

That is quite a lot of boilerplate, but at least it manages to get me started. Will there be a simpler interface some day?

jvdp1 commented 1 year ago

I think you got the idea.

So to get this straight:

  • other_type = value_type
  • you have to manually create values for the objects key and other using the set() function so that they can be used by the dictionary
    • what's the [transfer ...] doing?

set only supports int8 arrays or character. Therefore, transfer is used to copy the bitwise representation of the original key to the int8 array. The user can write her/his own code for her/his specific variable.

Note that in my example, my key is composed of both character and integer.

  • why not at have set as a method of key_type/other_type?

Actually the set is not required. For example, the following lines could be replaced by:

 call set(key, [transfer("dummy", 1_int8, len("dummy")), transfer(i, 1_int8)])
 call set(other, dummy)
 call map%map_entry(key, other, conflict)

by (with key aliasing key_type and other aliasing other_type):

  call map%map_entry(key([transfer("dummy", 1_int8, len("dummy")), transfer(i, 1_int8)]), other(dummy), conflict)
  • map_entry == operator[]
  • and then just reverse the operation for get

That is quite a lot of boilerplate, but at least it manages to get me started. Will there be a simpler interface some day?

I guess it is. In my opinion, it gives a lot of (too much) flexibility to the users. I don't think there are plans to simplify the interfaces. However, wrappers could be added, and we are open to feedbacks from users to improve it.

LecrisUT commented 1 year ago

Well, one thing I couldn't find is how to store and retrieve pointers instead of values

jvdp1 commented 1 year ago

Well, one thing I couldn't find is how to store and retrieve pointers instead of values

Do you mean something like that:

 input = 'aaaarrrr'

 output => input

 call map%map_entry(key_type([transfer("pointer", 1_int8, len("pointer"))]), other_type(output), conflict)

 call map%get_other_data(key_type([transfer("pointer", 1_int8, len("pointer"))]), other, exists)

 call get(other, data)

 select type (data)
 type is (character(*))
  print *, 'character = ', data
 class default
  print *, 'Invalid data type in other'
 end select
LecrisUT commented 1 year ago

I think it is when using non-intrisic data types so that it can only use class(*), pointer which cannot upcast to class(my_type), pointer and then it complains that it can't find an implementation for get(). Probably I can just overload get() to dynamically cast the pointer.

Fortran language really needs templating to cut down all of the boilerplate.

jvdp1 commented 1 year ago

I think it is when using non-intrisic data types so that it can only use class(*), pointer which cannot upcast to class(my_type), pointer and then it complains that it can't find an implementation for get(). Probably I can just overload get() to dynamically cast the pointer.

I think overloading it is indeed the right way.

Fortran language really needs templating to cut down all of the boilerplate.

I agree with you.

LecrisUT commented 1 year ago

Took me a bit of digging, but I found an issue with the previous example:

! This one only sets transfers "d"
call set(key, [transfer("dummy", 1_int8)])
! Should be
call set(key, [transfer("dummy", 1_int8, len("dummy")])
jvdp1 commented 1 year ago

@LecrisUT Good catch! Thank you. I'll edit my posts. Sorry for the troubles!