fortran-lang / stdlib

Fortran Standard Library
https://stdlib.fortran-lang.org
MIT License
1.07k stars 166 forks source link

List of strings (implementation ideas) #322

Open ivan-pi opened 3 years ago

ivan-pi commented 3 years ago

I thought it might be best to create a new issue to discuss alternative implementations for the list of strings proposed in #268. As @arjenmarkus wrote there:

This proposal considers the features that would be useful for this derived type, a list of strings. It does not prescribe the methods for implementing the type. Given the ease by which arrays are handled in Fortran, it is possible to use allocatable arrays, but perhaps a linked list may prove to be more efficient, at least in certain scenarios of use. Therefore the proposal concentrates on the usage.

I thought it might be helpful to expand on this topic using some analogies with the C++ library. In C++, the standard template library provides the special container std::string as a sequence of characters supporting fast random access, and fast insert/delete at back of the container.

In Fortran we can reach similar behavior/usage using the character(len=:), allocatable type:

character(len=:), allocatable :: str

! Automatic allocation upon initialization
str = "Hello"

! "Insertion" at the back
str = str//", world!"    ! "Hello, world!"

! Fast random access
str(8:8) = "W"           ! "Hello, World!"

! "Deletion"
str = str(1:len(str)-1)  ! "Hello, World"

There are of course also very large differences between the C++ std::string and the Fortran allocatable character sequences in terms of memory management and other things. Some of the differences can be inferred from the numerous member functions of std::string for access, modification, and string operations.

Moving one step further to a "lists of strings" in C++ there are several options. For "fixed-size" lists one has:

// Built-in (fixed) size array
std::string colour[4] = { "Blue", "Red",
                              "Orange", "Yellow" };

// Array of strings (safer, and easier to use than the built-in array)
std::array<std::string, 4> colour { "Blue", "Red", "Orange",
                                     "Yellow" };

For dynamic lists of strings the most commonly used option in C++ is

// Vector of strings (the "go to" container)
std::vector<std::string> colour {"Blue", "Red", "Orange"};
// Strings can be added at any time with push_back
colour.push_back("Yellow")

However, the STL library also contains other specialized containers such as: std::deque, std::list, and std::forward_list. Quoting from the textbook C++ Primer:

The idea of having different containers is they offer different performance trade-offs relative to

  • The costs to add or delete elements to the container
  • The costs to perform non-sequential access to elements of the container

The performance aspects of different sequential containers are summarized neatly in this table copied out of the same book:

Container Description
vector Flexible-size array. Supports fast random access.
Inserting or deleting elements other than at the back may be slow.
deque Double-ended queue. Supports fast random access.
Fast insert/delete at front or back.
list Doubly linked list. Supports only bidirectional sequential access.
Fast insert/delete at any point in the list.
forward_list Singly linked list. Supports only sequential access in one direction.
Fast insert/delete at any point in the list.
array Fixed-size array. Supports fast random access.
Cannot add or remove elements.
string A specialized container, similar to vector, that contains characters.
Fast random access. Fast insert/delete at back.

The beauty of the C++ containers is they share many common operations. If you can avoid random access, and instead use only iterators to access the elements you can switch easily between some containers. A table of the supported operations can be found here.

I believe the current implementation in #311 is most similar to the std::vector and uses contiguous storage for the underlying string elements. The comments from @esterjo (https://github.com/fortran-lang/stdlib/issues/268#issuecomment-748220606 and https://github.com/fortran-lang/stdlib/issues/241#issuecomment-747184656) already address the fact that other implementations might be more suitable in some usage cases in order to avoid memory fragmentation or support other operations.

Given that there is not much precedence for such string containers in Fortran, I find it hard to judge which implementation is most suitable for typical Fortran usage cases. Getting some feedback from the community could help guide further (experimental) implementation efforts.

arjenmarkus commented 3 years ago

If we want alternative implementations, tuned to various usage scenarios, then a common API is definitely desirable. You would simply replace the type name by the one suitable for the intended use - quite possibly even by using the rename feature of a module ;) (and for the real, lazy programmers among us: use an intermediate module to even hide that renaming, so that there is only one location in the source code that determines which type will be actually used).

That said, I wonder how we can measure the consequences of the various implementations: performance is one thing and while not all that easy to get right, it is definitely measurable. But memory fragmentation is, I guess, a very different thing. I have no idea how you can measure that - or what a useful measure is.

We might make it part of the GSoC containers project or even a project in its own right!

Op ma 15 feb. 2021 om 21:04 schreef Ivan Pribec notifications@github.com:

I thought it might be best to create a new issue to discuss alternative implementations for the list of strings proposed in #268 https://github.com/fortran-lang/stdlib/issues/268. As @arjenmarkus https://github.com/arjenmarkus wrote there:

This proposal considers the features that would be useful for this derived type, a list of strings. It does not prescribe the methods for implementing the type. Given the ease by which arrays are handled in Fortran, it is possible to use allocatable arrays, but perhaps a linked list may prove to be more efficient, at least in certain scenarios of use. Therefore the proposal concentrates on the usage.

I thought it might be helpful to expand on this topic using some analogies with the C++ library. In C++, the standard template library provides the special container std::string as a sequence of characters supporting fast random access, and fast insert/delete at back of the container.

In Fortran we can reach similar behavior/usage using the character(len=:), allocatable type:

character(len=:), allocatable :: str

! Automatic allocation upon initialization str = "Hello"

! "Insertion" at the back str = str//", world!" ! "Hello, world!"

! Fast random access str(8:8) = "W" ! "Hello, World!"

! "Deletion" str = str(1:len(str)-1) ! "Hello, World"

There are of course also very large differences between the C++ std::string and the Fortran allocatable character sequences in terms of memory management and other things. Some of the differences can be inferred from the numerous member functions( http://www.cplusplus.com/reference/string/string/) of std::string for access, modification, and string operations.

Moving one step further to a "lists of strings" in C++ there are several options. For "fixed-size" lists one has:

// Built-in (fixed) size array std::string colour[4] = { "Blue", "Red", "Orange", "Yellow" }; // Array of strings (safer, and easier to use than the built-in array) std::array<std::string, 4> colour { "Blue", "Red", "Orange", "Yellow" };

For dynamic lists of strings the most commonly used option in C++ is

// Vector of strings (the "go to" container) std::vector colour {"Blue", "Red", "Orange"};// Strings can be added at any time with push_back colour.push_back("Yellow")

However, the STL library also contains other specialized containers such as: std::deque, std::list, and std::forward_list. Quoting from the textbook C++ Primer:

The idea of having different containers is they offer different performance trade-offs relative to

  • The costs to add or delete elements to the container
  • The costs to perform non-sequential access to elements of the container

The performance aspects of different sequential containers are summarized neatly in this table copied out of the same book: Container Description vector Flexible-size array. Supports fast random access. Inserting or deleting elements other than at the back may be slow. deque Double-ended queue. Supports fast random access. Fast insert/delete at front or back. list Doubly linked list. Supports only bidirectional sequential access. Fast insert/delete at any point in the list. forward_list Singly linked list. Supports only sequential access in one direction. Fast insert/delete at any point in the list. array Fixed-size array. Supports fast random access. Cannot add or remove elements. string A specialized container, similar to vector, that contains characters. Fast random access. Fast insert/delete at back.

The beauty of the C++ containers is they share many common operations. If you can avoid random access, and instead use only iterators to access the elements you can switch easily between some containers. A table of the supported operations can be found here http://www.cplusplus.com/reference/stl/.

I believe the current implementation in #311 https://github.com/fortran-lang/stdlib/pull/311 is most similar to the std::vector and uses contiguous storage for the underlying string elements. The comments from @esterjo https://github.com/esterjo (#268 (comment) https://github.com/fortran-lang/stdlib/issues/268#issuecomment-748220606 and #241 (comment) https://github.com/fortran-lang/stdlib/issues/241#issuecomment-747184656) already address the fact that other implementations might be more suitable in some usage cases in order to avoid memory fragmentation or support other operations.

Given that there is not much precedence for such string containers in Fortran, I find it hard to judge which implementation is most suitable for typical Fortran usage cases. Getting some feedback from the community could help guide further (experimental) implementation efforts.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/322, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR74MYKXW4MLOW2GSNLS7F46TANCNFSM4XVHVWXA .

arjenmarkus commented 3 years ago

Something I forgot to brig up here: I recently wrote a small module that reads in an entire text file into memory and then splits it up in individual lines. The memory used consists of one block of memory and pointers to the start and stop of the lines. My intention is to expand this, so that it can be used to get data from CSV files and the like. It relies on a different memory layout than the stringlist_type, because I know in advance how much memory is required.

Op di 16 feb. 2021 om 08:33 schreef Arjen Markus <arjen.markus895@gmail.com

:

If we want alternative implementations, tuned to various usage scenarios, then a common API is definitely desirable. You would simply replace the type name by the one suitable for the intended use - quite possibly even by using the rename feature of a module ;) (and for the real, lazy programmers among us: use an intermediate module to even hide that renaming, so that there is only one location in the source code that determines which type will be actually used).

That said, I wonder how we can measure the consequences of the various implementations: performance is one thing and while not all that easy to get right, it is definitely measurable. But memory fragmentation is, I guess, a very different thing. I have no idea how you can measure that - or what a useful measure is.

We might make it part of the GSoC containers project or even a project in its own right!

Op ma 15 feb. 2021 om 21:04 schreef Ivan Pribec <notifications@github.com

:

I thought it might be best to create a new issue to discuss alternative implementations for the list of strings proposed in #268 https://github.com/fortran-lang/stdlib/issues/268. As @arjenmarkus https://github.com/arjenmarkus wrote there:

This proposal considers the features that would be useful for this derived type, a list of strings. It does not prescribe the methods for implementing the type. Given the ease by which arrays are handled in Fortran, it is possible to use allocatable arrays, but perhaps a linked list may prove to be more efficient, at least in certain scenarios of use. Therefore the proposal concentrates on the usage.

I thought it might be helpful to expand on this topic using some analogies with the C++ library. In C++, the standard template library provides the special container std::string as a sequence of characters supporting fast random access, and fast insert/delete at back of the container.

In Fortran we can reach similar behavior/usage using the character(len=:), allocatable type:

character(len=:), allocatable :: str

! Automatic allocation upon initialization str = "Hello"

! "Insertion" at the back str = str//", world!" ! "Hello, world!"

! Fast random access str(8:8) = "W" ! "Hello, World!"

! "Deletion" str = str(1:len(str)-1) ! "Hello, World"

There are of course also very large differences between the C++ std::string and the Fortran allocatable character sequences in terms of memory management and other things. Some of the differences can be inferred from the numerous member functions( http://www.cplusplus.com/reference/string/string/) of std::string for access, modification, and string operations.

Moving one step further to a "lists of strings" in C++ there are several options. For "fixed-size" lists one has:

// Built-in (fixed) size array std::string colour[4] = { "Blue", "Red", "Orange", "Yellow" }; // Array of strings (safer, and easier to use than the built-in array) std::array<std::string, 4> colour { "Blue", "Red", "Orange", "Yellow" };

For dynamic lists of strings the most commonly used option in C++ is

// Vector of strings (the "go to" container) std::vector colour {"Blue", "Red", "Orange"};// Strings can be added at any time with push_back colour.push_back("Yellow")

However, the STL library also contains other specialized containers such as: std::deque, std::list, and std::forward_list. Quoting from the textbook C++ Primer:

The idea of having different containers is they offer different performance trade-offs relative to

  • The costs to add or delete elements to the container
  • The costs to perform non-sequential access to elements of the container

The performance aspects of different sequential containers are summarized neatly in this table copied out of the same book: Container Description vector Flexible-size array. Supports fast random access. Inserting or deleting elements other than at the back may be slow. deque Double-ended queue. Supports fast random access. Fast insert/delete at front or back. list Doubly linked list. Supports only bidirectional sequential access. Fast insert/delete at any point in the list. forward_list Singly linked list. Supports only sequential access in one direction. Fast insert/delete at any point in the list. array Fixed-size array. Supports fast random access. Cannot add or remove elements. string A specialized container, similar to vector, that contains characters. Fast random access. Fast insert/delete at back.

The beauty of the C++ containers is they share many common operations. If you can avoid random access, and instead use only iterators to access the elements you can switch easily between some containers. A table of the supported operations can be found here http://www.cplusplus.com/reference/stl/.

I believe the current implementation in #311 https://github.com/fortran-lang/stdlib/pull/311 is most similar to the std::vector and uses contiguous storage for the underlying string elements. The comments from @esterjo https://github.com/esterjo (#268 (comment) https://github.com/fortran-lang/stdlib/issues/268#issuecomment-748220606 and #241 (comment) https://github.com/fortran-lang/stdlib/issues/241#issuecomment-747184656) already address the fact that other implementations might be more suitable in some usage cases in order to avoid memory fragmentation or support other operations.

Given that there is not much precedence for such string containers in Fortran, I find it hard to judge which implementation is most suitable for typical Fortran usage cases. Getting some feedback from the community could help guide further (experimental) implementation efforts.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/322, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR74MYKXW4MLOW2GSNLS7F46TANCNFSM4XVHVWXA .