Linked list - Githubissues

milancurcic commented 4 years ago

Problem

Linked list is one of the essential data structures beside an array. It allows you to add, insert, or remove elements in constant time, without re-allocating the whole structure.

Fortran doesn't have a linked list. There are 3rd party libraries, but no obvious go-to solution. Fortran stdlib should have a linked list. I would use it.

Examples

FLIBS by @arjenmarkus
fortran-list by @LadaF
flist by @jacobwilliams
PolyCon by @cmacmackin
Petaca by @nncarlson
Modern Fortran Explained by MRC has an implementation in Appendix C
Many others?

What kind of data can the linked list hold?

There's various levels of capability we could pursue:

Single type: Basically just like an array, but allows insertion in constant time;
Elements can be of any intrinsic type in a single list;
Can take intrinsic type and user-defined derived types (is this even possible in current Fortran?)

API

I don't know, something like this?


use stdlib_experimental_collections, only :: List
type(List) :: a = List()

call a % append(42)
call a % append(3.141)
call a % append('text')
print *, a % get(2) ! prints 3.141
call a % remove(3) ! a is now List([42, 3.141])
call a % insert(2, 'hello') ! a is now List([42, 'hello', 3.141])

a = List([1, 2, 3]) ! instantiate a list from an array

tclune commented 2 years ago

@misael-diaz You may also have inadvertently helped to identify a bug in my set implementation. I tried the analogous O(N^2) clean up process and it crashes my code for the larger cases. Have not yet tracked down just what is going wrong though. There is an extensive set of use cases, and my real world cases are actually modest in size (metadata for an Earth system model.)

misael-diaz commented 2 years ago

@tclune glad to be of some help and thanks again for sharing the specs. I am using Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz, GCC 10.3.0, and no optimizations enabled. I shall try the O(N) implementation which you've suggested and estimate the speed-up. I am sure that you will be able to patch your code soon. Bests

tclune commented 2 years ago

May not be a bug in the container itself after all. Same code works fine with NAG compiler. If it's a compiler bug, it might be real fun to fix given that it does not happen for smaller containers.

certik commented 2 years ago

For 10^6 elements, my IntegerSet container builds in 2 seconds and destroys in 0.21 seconds. If I insert the elements in order, the construction time reduces to ~1 second.

The following code seems equivalent (to construct in order) and has the following performance on my Apple M1:

$ ./a.out 
 construction time:    9.65399947E-03
 deconstruction time:    9.49999958E-05

Code:

program main
use, intrinsic :: iso_fortran_env, only: i8=>int64, dp=>real64
implicit none

real(dp), allocatable :: s(:)
integer :: i, j, n
real(dp) :: x

integer(i8) :: c0, c1, cr

n = 10**6
allocate(s(n))

call system_clock(c0, cr)
do i = 1, N
    call random_number(x)
    j = 1 + floor(N*x)
    s(i) = i ! Was this meant to be "j"?
end do
call system_clock(c1)

print*,'construction time: ', real(c1-c0)/cr

call system_clock(c0)
deallocate(s)
call system_clock(c1)
print*,'deconstruction time: ', real(c1-c0)/cr
end program main

To insert in reverse order (as the code above), just apply the following patch:

--- a.f90   2021-11-12 15:43:48.000000000 -0700
+++ b.f90   2021-11-12 15:48:22.000000000 -0700
@@ -15,7 +15,7 @@
 do i = 1, N
     call random_number(x)
     j = 1 + floor(N*x)
-    s(i) = i ! Was this meant to be "j"?
+    s(N-i+1) = i ! Was this meant to be "j"?
 end do
 call system_clock(c1)

and this runs in:

$ ./a.out
 construction time:    9.34099965E-03
 deconstruction time:    9.79999968E-05

@tclune, do you have some benchmark where a list is the optimal choice for performance reasons?

tclune commented 2 years ago

Not sure the array case is very relevant here - it has different properties. Even my Set example was only relevant in that it has a similar complexity for deletion. Insertion for Set is quite different than Insertion for List.

I do not readily know of a case where List would be the optimal solution, but surely such exists out there. It would have to be an example where insertions happen in the middle and that there is already a pointer into the relevant position in the list. (Otherwise the search for the location becomes the expensive step.) A synthetic example would be where all values < X must go before all values > X, and that the next value must be placed at the boundary between those two. So you are constantly inserting in the "middle" of the list but not having to search more than one hop in either direction. An Array or Vector would require lots of copies of the latter half of entries as data accumulates. A set would not maintain the desired ordering. But two vectors would accomplish the same very efficiently, if you are allowed to interpret the order of one as reversed.

misael-diaz commented 2 years ago

Thank you @tclune and @certik for your instructive comments. I have been able to implement the O(N) list destructor without incurring on memory leaks. The runtime has improved remarkably and now it only takes a fraction of a second to generate and destroy the list.

head: 0
tail: 65535
==16043== 
==16043== HEAP SUMMARY:
==16043==     in use at exit: 0 bytes in 0 blocks
==16043==   total heap usage: 196,616 allocs, 196,616 frees, 2,360,416 bytes allocated
==16043== 
==16043== All heap blocks were freed -- no leaks are possible
==16043== 
==16043== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

real    0m0.818s
user 0m0.762s
sys 0m0.056s

certik commented 2 years ago

@tclune yes, so far every example where a linked list would perform well that I figured out can always be implemented using arrays/vectors more efficiently.

milancurcic commented 2 years ago

@certik, I think to make the comparison equivalent, you need to grow the array in the loop, like a list or set is doing. So:

...
allocate(s(0))

call system_clock(c0, cr)
do i = 1, N
    s = [s, real(i, dp)]
end do
call system_clock(c1)
...

which is very slow. Of course, you can alleviate by using a vector, but at that point you're introducing a new structure.

One use case that comes to mind for structures with fast insertion like lists is adaptive refining and coarsening of unstructured meshes.

certik commented 2 years ago

Yes, my point would be that you don't want to do that. If you don't know the size of the loop, then one can double the allocation and make a copy, like std::vector::push_back() does. That's actually very efficient.

Do you know if a Fortran compiler is allowed to implement s = [s, real(i, dp)] by doubling the allocation and copying data? If so, then it would be as efficient as std::vector.

One use case that comes to mind for structures with fast insertion like lists is adaptive refining and coarsening of unstructured meshes.

Indeed, there are several ways to implement those. But I don't think you would use the list data structure as discussed above, I think typically you would design something specific for your mesh.

milancurcic commented 2 years ago

Do you know if a Fortran compiler is allowed to implement s = [s, real(i, dp)] by doubling the allocation and copying data? If so, then it would be as efficient as std::vector.

Under the hood I don't see why not. As long as for the user the array behaves as per the standard, I think the compiler should be free to do any kind of smart thing to make it efficient (like a vector for example).

gronki commented 2 years ago

As a side note I would like to mention that I have strong objections whether s=[s, new] is a good coding pattern for appending an element to an array in Fortran. It looks nice and simple which is tempting but it relies on default assignment and behaves differently when the assignment operator is overloaded. Which introduces like of consistency. So speaking of implementation, I would argue that there are good reasons for this pattern to be discouraged and replaced by a procedure that behaves consistently regardless of what type "s" is.

Dominik

sob., 13 lis 2021, 19:42 użytkownik Ondřej Čertík @.***> napisał:

Yes, my point would be that you don't want to do that. If you don't know the size of the loop, then one can double the allocation and make a copy, like std::vector::push_back() does. That's actually very efficient.

Do you know if a Fortran compiler is allowed to implement s = [s, real(i, dp)] by doubling the allocation and copying data? If so, then it would be as efficient as std::vector.

One use case that comes to mind for structures with fast insertion like lists is adaptive refining and coarsening of unstructured meshes.

Indeed, there are several ways to implement those. But I don't think you would use the list data structure as discussed above, I think typically you would design something specific for your mesh.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/68#issuecomment-968115953, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4NA3P2TAFTRBCN6RPH3ZLUL2WPVANCNFSM4KCFV36A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

tclune commented 2 years ago

I suspect no Fortran compilers do this. For multidimensional cases, the risk of oversubscribing memory is too great, and even in the 1D case, the compiler should not risk overallocation unless it is confident that the memory won't be needed for something else. Maybe there are special cases where the compiler can analyze the final outcome of the loop, but ... then its really just reimplementing your loop.

certik commented 2 years ago

It looks like a dedicated method push_back (in stdlib!) might be the way to go. The compiler can then optimize it like std::vector works by doubling the allocation if needed.

tclune commented 2 years ago

Not sure about "compiler can optimize". The algorithm is baked into the implementation of STL vectors. The vector interface does not really give the compiler any hint of how the container will be used.

Also note that there are other mechanisms for growing a vector. One can specify a minimum size even after the container is partially filled. This might be useful if you know you are memory constrained and don't want to risk the default doubling.

certik commented 2 years ago

Not sure about "compiler can optimize". The algorithm is baked into the implementation of STL vectors. The vector interface does not really give the compiler any hint of how the container will be used.

If push_back is an stdlib routine, then compilers can understand the meaning of it, which is they can optimize the use of it by reallocating the array.

KoldMen commented 2 years ago

I have been programming in Fortran since 1981. So I guess I know a thing or two about Fortran. I do a lot of high performance computing. I use Fortran for computations and write my GUI in VB, C#, Java, and had used Pascal and QBasic in their hey days for the front end.

I use a lot of graphs and data structures in these languages. Lists are the more commonly used than arrays. One can keep adding to a list without predefining its length. This is important for system programmers. I have never missed lists in my computational engines, except when trying to use stacks. I have my own implementation of a stack where I create a new array and copy the old one when the stack size increases beyond the allocated size. I know all about inefficiency. But then try replacing a stack with an array in any algorithm.

Should Fortran have lists, the answer is YES. Should it be super efficient ? The answer is NO.

Allow the end users like me make the choice. Not all programmers would be writing weather forecasting routines. Those who write will never use a list.

List is a basic feature of modern languages. If you don't want to implement, then you are essentially limiting Fortran to a special use, namely Computations. I am okay with this. But then don't cry that the new generation is not using Fortran.

Add modern features like lists or die.

The whole arguments above seem to miss the point that we stopped teaching Fortran formally in my University and elsewhere more than 20 years ago. We teach Python to freshmen, and they love it. An avid programmer friend from NIST advised me to shift to Python since Python has so many libraries. And here you are, arguing about implementing a simple list. So funny.

Anyway, thanks for trying to build the stdlib, even if it is 30 years too late.

certik commented 2 years ago

@KoldMen I agree with your post. One nuance is that above we are arguing about the API of a list data structure in stdlib built using Fortran. Notice that in Python the list is part of the language itself, it is not implemented in Python. In LFortran we have actually recently added List as a first class feature into the intermediate representation, and it will get very efficient implementation in the backends. We have not exposed it in the frontend (syntax) yet, but one option is to simply recognize a list from stdlib, and transform it into the List node in the intermediate representation (instead of using the Fortran implementation from stdlib), and thus it would get excellent performance with LFortran. Other compiles could either do the same, or they could just use the actual implementation from stdlib (slower, but it would work).

We also have dictionaries, tuples and sets in LFortran. So once stdlib has an implementation of those, we can hook them in in the frontend.

awvwgk commented 2 years ago

Now that we have hash maps in stdlib, it would be nice to also have lists for arbitrary types. This would allow IO libraries to directly export the data they are reading in a stdlib compatible format.

beddalumia commented 2 years ago

@milancurcic

Me too, but how do you do it? I thought that appending to an array always re-allocates on heap, e.g.: integer :: i integer, allocatable :: a(:) a = [integer ::] do i = 1, 100000000 a = [a, i] ! re-allocates a on every append end do It's okay, for small-to-moderate arrays, but for very large ones, isn't it crippling?

and

... allocate(s(0)) call system_clock(c0, cr) do i = 1, N s = [s, real(i, dp)] end do call system_clock(c1) ... which is very slow. Of course, you can alleviate by using a vector, but at that point you're introducing a new structure.

Sorry for going slightly off topic (although I see I'm not the only one pushing for arrays instead of lists, for the case of homogeneous collections), but I parsed (admittedly fast) the whole thread and surprisingly did not find a mention to F2003 intrinsic move_alloc which makes for a way faster alternative to the badly performant "automatic reallocation" you mentioned in these two messages (which remarkably stay respectively on top and quite on the tail of the discussion).

I actually did not know about automatic allocations with array constructors, but have implemented in the past a compressed-sparse-row grower using move_alloc and found it quite efficient. So now, reading here, I got surprised about your viewpoint and went on the discourse to find a nice post which compares systematically the two approaches (and something more in the comments), so you might want to check it out.
Incidentally I think move_alloc should probably be the core functionality to implement whatever variation of the std::vector::push_back() strategy that @certik is proposing (one could go more flexible than just always doubling the size...). The discourse thread also briefly mentions this and points to an interesting implementation in FLIBS.

Aside of all of this, I also agree that the user should have a choice and adding linked lists in stdlib may be beneficial for many people, just wanted to go a little off-track and point out this alternative to the readers (like me 🙃). But maybe I would prefer an inhomogeneous one, following @awvwgk's idea of simplifying IO operations (very much like Matlab does with its cell arrays, that are returned whenever an IO intrinsic operates on inhomogeneous files). The idea would then be quite "classic": homogenous collections go into (dynamic) arrays, inhomogeneous ones into lists.

gronki commented 1 year ago

Dear Ondrej,

In LFortran we have actually recently added List as a first class feature into the intermediate representation, and it will get very efficient implementation in the backends. We have not exposed it in the frontend (syntax) yet, but one option is to simply recognize a list from stdlib, and transform it into the List node in the intermediate representation (instead of using the Fortran implementation from stdlib), and thus it would get excellent performance with LFortran. Other compiles could either do the same, or they could just use the actual implementation from stdlib (slower, but it would work).

We also have dictionaries, tuples and sets in LFortran. So once stdlib has an implementation of those, we can hook them in in the frontend.

Simply wow. Hats off. This is the kind of change that this language needs.

certik commented 1 year ago

Thank you @gronki!

The LPython compiler (which shares the middle end and backends with LFortran) has lists and on our initial preliminary benchmarks it seems faster than Clang/GCC's std::vector (with all optimizations on) for most people:

The same with dictionaries against std::unordered_map:

https://github.com/lcompilers/lpython/issues/1000

So LFortran already has this capability; all that is needed is to expose it via some syntax.

Our experience with LPython is to use regular Python syntax for our fast features and provide a CPython implementation. That seems to work really well. In the same way, as indicated above, we can expose these nice LFortran features via regular (existing) Fortran syntax and provide a Fortran implementation. LFortran would recognize it and use List.

That still leaves the door open to also create an extension of the Fortran language, we can still do that.

@milancurcic, @everythingfunctional if you want to move this forward, let's get a usable List implementation into stdlib, and then we can teach LFortran about it. Everything is ready from my side, we just need the "syntax" in stdlib.

I recommend not to use linked list, but rather store the length, capacity and the data, just like Python, or std::vector does it. LFortran also does it that way.

everythingfunctional commented 1 year ago

That's a neat way to do it. I can implement a list for stdlib. It shouldn't be hard. However, what should the API look like? I'd lean towards something like

type :: list
  private
  ! store values and whatever else makes sense
contains
  procedure :: item
end type

interface list
  pure module function from_array(values)
    real, intent(in) :: values(:)
    type(list) :: from_array
  end function

  pure module function of_size(size, init, default)
    integer, intent(in) :: size
    real, intent(in), optional :: init !! fill all elements with this value if provided
    real, intent(in), optional :: default !! expand size of list and provide this value if asked for non-existent element
    type(list) :: of_size
  end function
end interface

interface
  module function item(self, position)
    class(list), intent(inout) :: self !! allows for expansion of size if necessary
    integer, intent(in) :: position
    real, pointer :: item !! pointer allows to appear on lhs of assignment
  end function
end interface

milancurcic commented 1 year ago

There's an implementation in #491. It needs some work (mainly docs and tests) to get it through the finish line. I suggest discussing and adjusting (if needed) the API there.

everythingfunctional commented 1 year ago

There's an implementation in #491. It needs some work (mainly docs and tests) to get it through the finish line. I suggest discussing and adjusting (if needed) the API there.

That is I suppose a place to start the conversation, but I think it has issues, such as:

It's a linked list, and the interface seems (to me) to imply that. And @certik indicated we probably shouldn't do that.
It's unlimited polymorphic. My suspicion is that we would rather have lists of a specific type to avoid the overhead and clunky usage of class(*) things.

@certik , what is the API for the list that LFortran uses?

arjenmarkus commented 1 year ago

During the GSoC project we have seen that the implementation is actually quite fast. But of course, there may be better ways to achieve the functionality.

Op di 18 okt. 2022 om 03:33 schreef Brad Richardson < @.***>:

There's an implementation in #491 https://github.com/fortran-lang/stdlib/pull/491. It needs some work (mainly docs and tests) to get it through the finish line. I suggest discussing and adjusting (if needed) the API there.

That is I suppose a place to start the conversation, but I think it has issues, such as:

It's a linked list, and the interface seems (to me) to imply that. And @certik https://github.com/certik indicated we probably shouldn't do that.

It's unlimited polymorphic. My suspicion is that we would rather have lists of a specific type to avoid the overhead and clunky usage of class(*) things.

@certik https://github.com/certik , what is the API for the list that LFortran uses?

— Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/68#issuecomment-1281695112, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR7OCG2GSDEQSKRATTLWDX457ANCNFSM4KCFV36A . You are receiving this because you were mentioned.Message ID: @.***>

milancurcic commented 1 year ago

I think there are a few different conversations here (lists vs. vectors and API vs. implementation). Some comments:

This is a thread for linked lists. I think we should start a new thread for a vector to keep efforts more focused.
There seems to be more appetite (around Fortran-lang) for vectors than for lists. For me as well, vectors are more useful than lists.
There has been almost no interest so far in what an API for a list would look like. For example, see #463 and there were several Discourse posts by Chetan with little to no feedback. Maybe there's more interest now.
I also prefer a single intrinsic typed list over class(*). It makes working with a list much easier, no select type stuff as already pointed out by Brad, and I have to date not had a need in Fortran to mix types in a list (and very rarely in Python).

certik commented 1 year ago

Let's keep the discussion here to have it in one place, it's all related.

In LPython we don't allow mixing types in a list, all items have to be of the same type. Consequently also in LFortran.
We can allow mixing types in stdlib, and then LFortran would only optimize lists of the same type, and if you mix types, it would not be optimized; I would lean to not allow mixing of types.
the term "list" in my mind visualizes Python list like a = [1, 2, 3] and the operations that you can do on it, like "append" and "extent" and concatenation a = [1, 2, 3] + [10, 11]. However, Python underneath does not use linked list, but rather implements the list using a size, capacity and a pointer. So I would recommend using the term "list" for this datastructure. If we want to specify a particular implementation, we can have both List and LinkedList. Using the term "vector" for this I think is confusing, because vector in this sense is the same as an array and Fortran already has arrays. So I would stick with Python's terminology, which is "list" (cheap appending) and "array" (NumPy array, for our purposes equivalent to Fortran arrays; expensive appending). (In my mind, the C++ implementation of the list is "std::vector". There is also std::list which I think is a linked list and I don't think I've ever used it, as it is very slow. I recommend using Python, not C++ terminology.)
We have to figure out the API of a List, that is the main task.
We can implement any API into LFortran's frontend; the intermediate representation uses the following API: it has statement nodes and expression nodes and a type List which has a type of the element, such as List[f64] (using LPython's syntax), in ASR notation (here is the full ASR so that you can see the context of the nodes below: https://github.com/lfortran/lfortran/blob/383831d7efb9b5b1bc00c74e928c8997b7470a95/src/libasr/ASR.asdl):
```
stmt
| ListAppend(expr a, expr ele)
| ListInsert(expr a, expr pos, expr ele)
| ListRemove(expr a, expr ele)
| ListClear(expr a)
```

expr | ListConstant(expr* args, ttype type) | ListLen(expr arg, ttype type, expr? value) | ListConcat(expr left, expr right, ttype type, expr? value) | ListItem(expr a, expr pos, ttype type, expr? value) | ListSection(expr a, array_index section, ttype type, expr? value) | ListPop(expr a, expr? index, ttype type, expr? value)

ttype | List(ttype type)

milancurcic commented 1 year ago

OK, your terminology is confusing to me, but we can pick any that most people here agree on. For this conversation, we can just call them "linked list" and "list", as you suggest.

Based on your description, as I understand it, your list is the same as a dynamic array. All elements of the same type, cheap indexing and prepending/appending, slow inserting.

The difference is not just an implementation detail. They're different data structures. I initially opened this thread to ask whether there's a desire for a structure with fast insertion (a linked list). I was motivated by the fact that there are many home-cooked implementations on the internet of Fortran linked lists, so it's possible that the community would benefit from a stdlib implementation. For me personally, your list is a more useful structure.

Considering the option of having both structures (linked list and list), do you think it would be a good design if they had the same (to the extent possible) API?

awvwgk commented 1 year ago

There is also need for a list holding class(*) values, for example to implement support for #671 using stdlib. If only one type is allowed user will need to manually instantiate a list using:

type :: wrapped_class
  class(*), allocatable :: raw
end type wrapped_class

And deal with the class dispatch themselves. In this case stdlib could directly provide such a wrapper class.

tclune commented 1 year ago

Related note: In the initial release of generics in Fortran, one will not be able to write CLASS(T) where T is a deferred type. So one cannot directly use a single template that works for both TYPE(T) and CLASS(T). Well, a user could implement their own wrapper and then either swear up and down in packing/unpacking at each interface, or write their own wrappers.

And you cannot even write an "outer" template to provide the polymorhic case, because you cannot even define the wrapper type within the template. Again - there is no way to write CLASS(T).

And no, I'm not happy about this, as well over half of my uses of containers are polymorphic. The majority of the subgroup prefers to wait for the initial set of features to be implemented so that we can more accurately plan subsequent extensions. I.e., so that we worry about the things that matter most after experience is gained.

certik commented 1 year ago

We can discuss the terminology at our next Fortran meeting. :)

@tclune, we have our former GSoC student (@ansharlubis) working on the generics in LFortran. If you know about anybody else who can help, please let me know. While we are at this: there does not seem to be many good examples of the syntax. Here is our issue for this: https://github.com/lfortran/lfortran/issues/929 where I linked all examples that I was able to find so far. If you create more examples, that would be a huge help.

tclune commented 1 year ago

@certik The only examples that properly use the syntax as it appears in the formal syntax paper are: linear_algebra and fundamental_requirements. (These are in the last link you provide in the reference above.)

Until now, the syntax has been a (slowly) moving target, and maintaining the examples takes time. If/when the syntax paper is approved, we can start producing "final" examples and deleting the ones with obsolete syntax.

certik commented 1 year ago

I see, thanks @tclune!

beddalumia commented 1 year ago

About terminology

It may be worth noting that Wikipedia explicitly states that a list can be implemented either as a linked list or a dynamic array (plus a generic statement about the former being more frequent for lisp dialects, where the terms "list" and "linked list" are often conflated.)

https://en.m.wikipedia.org/wiki/List_(abstract_data_type)

To be fair the first picture you see is a depiction of a linked list, so I totally get that it can be misleading. I've also being taught in bachelor that a "list" is a "linked list".

In general, I wouldn't want to give too much importance to Wikipedia per sé, but I like that is more or less language agnostic, and we do not necessarily want to follow conventions from either the C++, or python community or anything else, rather use terms that would be familiar for people with different backgrounds and, even more importantly, would not result confusing after googling them. The C++ case gives an important lesson in my opinion, as

to me it violates the principle of least surprise with std::vector (no one that is not aware of C++ specific lore would ever imagine what it is, from the name... if googling works now it's just for how much C++ containers are pervasive at this point)
given the diffusion and extensive usage nowadays, it's basically impossible to amend the name, so yes, it would be important to avoid choosing a misleading/debatable name from the start.

Relevant quote from stack overflow:

It's called a vector because Alex Stepanov, the designer of the Standard Template Library, was looking for a name to distinguish it from built-in arrays. He admits now that he made a mistake, because mathematics already uses the term 'vector' for a fixed-length sequence of numbers. C++11 compounds this mistake by introducing a class 'array' that behaves similarly to a mathematical vector. Alex's lesson: be very careful every time you name something.

original link with reference and discussion

beddalumia commented 1 year ago

Considering the option of having both structures (linked list and list), do you think it would be a good design if they had the same (to the extent possible) API?

I think this is exactly what you'll want to do, it would make easy for client code to switch from one to the other, according to different needs / requirements changes / benchmarks.

I was motivated by the fact that there are many home-cooked implementations on the internet of Fortran linked lists, so it's possible that the community would benefit from a stdlib implementation. For me personally, your list is a more useful structure.

I may just comment that I know that in the codebases that I'm working on during the PhD there is currently a homebrew implementation of a dynamic array (for sparse rows...) Older people told me that that was initially implemented as a linked list and used in many performance critical parts of the code. As they eventually scaled up the requirements, crafted a massively parallel algorithm for the iterative solver and started profiling the MPI performance they immediately spotted these linked lists as the main bottleneck and changed to dynamic arrays to give the very same "list functionality" but with acceptable performance.

beddalumia commented 1 year ago

As for class(*) containers, what about keeping it separated from lists, also in naming convention?

They would be inherently inferior for performance, but provide a whole new level of flexibility... this should suffice to qualify them as a totally separated thing, targeting very different use cases. Matlab calls such things[^1] "cells". How do you feel about such a name? Alternatives?

[^1]: not really sure about the underlying implementation, but the API is somewhat the same: a collection of heterogeneous things, generally slower than a builtin array (remember that in matlab all arrays are dynamic, so satisfy Ondrej's definition of list), and with quite limited support to be passed around. E.g. you cannot do plot(x, y) if x and y are cells, nor you can do linear algebra, etc. They provide a suitable cell2mat method to convert homogenous cells to arrays, when needed.

milancurcic commented 1 year ago

Since the terminology varies between communities. I suggest being explicit and descriptive such that the name reflects the properties of the structure. Perhaps those would be:

Linked List (type linked_list)
Dynamic Array (type dynamic_array)

ivan-pi commented 1 year ago

@milancurcic, is the difference that linked_list can hold items of different types, while dynamic_array contains items of the same type? Fortran allocatable arrays are dynamically allocated in the sense they can be resized. The dynamic_array on the other hand allows inserting/appending/removing (Wiki - Dynamic Array)?

I know a few algorithms which require a dynamic_array, and I implement them currently with an allocatable array and a smart reallocation procedure (https://github.com/fortran-lang/stdlib/issues/598). Recently, there was a query about a dynamic array at Fortran Discourse: https://fortran-lang.discourse.group/t/what-package-would-you-recommend-for-typed-lists/4359. In that case, the user only needed the append operation.

certik commented 1 year ago

For what is worth, the first time I heard the term dynamic array is from Milan above, and I wouldn't know what it means without this thread, since Fortran arrays are also "dynamic" (in my mind). But I don't have a better term for it right now (besides just List), so we can use DynamicArray, to mean the "size+capacity+pointer" implementation (like std::vector).

Thanks @bellomia for the comments! Yes, in LFortran's ASR (intermediate representation) we use the term List, and indeed it is independent of how it is actually implemented. So in order to be able to represent both linked list and size+capacity+pointer implementation, we could just add a flag to the type, such as "implementation=LinkedList | DynamicArray". So the API could indeed be exactly the same for both.

We can use "linked_list" and "dynamic_array" for implementations and then somehow alias "list" to mean "dynamic_array", as I think "list" is easier for people to understand what it means. Just "list" and "array".

milancurcic commented 1 year ago

Great, yes, I'm not so concerned about the final terminology in stdlib because we can arrive there through consensus in issues and PRs; but more that we are on the same page in this thread when we discuss these structures so that we know what each name refers to.

milancurcic commented 1 year ago

@ivan-pi As I understand it, the time complexity (fast insertion vs. fast indexing) and the type of elements are orthogonal.

A linked list has fast insertion and growing and shrinking but slow indexing. A list/vector has fast indexing and fast (amortized) growing and shrinking but slow insertion.

A linked list with class(*) elements allows you to mix types of elements and insert user-derived type instances as elements. But consuming data from the list is awkward because you need all the select type boilerplate. This is the kind of list that @ChetanKarwa worked on during his GSoC project.

But a regular list/vector can, I think, also be implemented with class(*) to allow the same flexibility with types.

And either structure with fixed (intrinsic) types is easy to consume elements from (no select type boilerplate needed) because the types are known at compile-time. And it's probably easier for a compiler to optimize because all elements have the same size in memory.

So, I think that choosing the specific structures to implement requires deciding along both dimensions, time complexity of operations, and type-flexibility vs. ease of use.

fortran-lang / stdlib

Linked list #68

Problem

Examples

What kind of data can the linked list hold?

API

About terminology