fortran-lang / stdlib

Fortran Standard Library
https://stdlib.fortran-lang.org
MIT License
1.06k stars 164 forks source link

Linked list #68

Open milancurcic opened 4 years ago

milancurcic commented 4 years ago

Problem

Linked list is one of the essential data structures beside an array. It allows you to add, insert, or remove elements in constant time, without re-allocating the whole structure.

Fortran doesn't have a linked list. There are 3rd party libraries, but no obvious go-to solution. Fortran stdlib should have a linked list. I would use it.

Examples

What kind of data can the linked list hold?

There's various levels of capability we could pursue:

  1. Single type: Basically just like an array, but allows insertion in constant time;
  2. Elements can be of any intrinsic type in a single list;
  3. Can take intrinsic type and user-defined derived types (is this even possible in current Fortran?)

API

I don't know, something like this?


use stdlib_experimental_collections, only :: List
type(List) :: a = List()

call a % append(42)
call a % append(3.141)
call a % append('text')
print *, a % get(2) ! prints 3.141
call a % remove(3) ! a is now List([42, 3.141])
call a % insert(2, 'hello') ! a is now List([42, 'hello', 3.141])

a = List([1, 2, 3]) ! instantiate a list from an array
certik commented 4 years ago

C++ has std::list for this. (I added Petaca to your Examples above.)

I would mention that I personally have never had a need for a std::list in C++, nor any linked list implementation in Fortran, because linked list is very slow (to create, traverse, destrogy, ...) compared to just a regular array or std::vector. The only operation that might be faster is insertion or deletion of individual items in the middle. In my use cases, I typically need to add elements to the end, in which case array works great.

But since there are at least 6 different people who reimplemented this already in Fortran and given that C++ has it too in their standard library, I would say that this would be a good candidate to include in stdlib, so that if people want to use it, they can. So +1 from me.

milancurcic commented 4 years ago

In my use cases, I typically need to add elements to the end, in which case array works great.

Me too, but how do you do it? I thought that appending to an array always re-allocates on heap, e.g.:

integer :: i
integer, allocatable :: a(:)
a = [integer ::]
do i = 1, 100000000
  a = [a, i] ! re-allocates a on every append
end do

It's okay, for small-to-moderate arrays, but for very large ones, isn't it crippling?

certik commented 4 years ago

The canonical way is to pre-allocate the array and then append to it, like this:

integer :: i
integer, allocatable :: a(:)
allocate(a(100000000))
do i = 1, 1000
    a(i) = i
end do

Then you use your actual application to figure out what the maximum size of the array is (100000000 in this example), and then you can either keep a as is (only use the first 1000 elements, as in this example), or you can copy it to a smaller array. A real world example is e.g. here: https://github.com/certik/hfsolver/blob/b4c50c1979fb7e468b1852b144ba756f5a51788d/src/sparse.f90#L111, the Bj_ array is pre-allocated to the maximum size first (determined from the sparse arrays), and then downsized before returning to the user: https://github.com/certik/hfsolver/blob/b4c50c1979fb7e468b1852b144ba756f5a51788d/src/sparse.f90#L127. This is typically still much faster than a linked list implementation. If you don't know the size ahead of time, then you can set some maximum at compile time and fail the program if you go over it (real world example: https://github.com/certik/hfsolver/blob/b4c50c1979fb7e468b1852b144ba756f5a51788d/src/basis.f90#L230) --- many times this is fine, as you can recompile the code easily. But sometimes that's not appropriate, so then you can also do what std::vector does --- it doubles the allocation every time you reach it, and copies the data. Here is a fast implementation of that that I use in LFortran (that's in C++, but one can do something similar in Fortran also): https://gitlab.com/lfortran/lfortran/blob/57d3b8077d884f0ff3945ad3a86b2da920e4b6b3/src/lfortran/parser/parser_stype.h#L22. All of these are fast options.

But as I said, it's good to have linked list in stdlib, if people prefer that, so that they do not need to reimplement it.

rweed commented 4 years ago

First I think we need to define which types of linked list we need. I prefer a circular double-linked list as the basic type since its the type I use most in FEM codes etc. I also think we would need a single-link list to implement stacks and queues. Also do we need some form of reference counting. As to @milancurcic question as to current Fortran support list that can contain both intrinsic and user defined types, yes it can. I've implemented both a circular list class and a single link class using unlimited polymorphic variables. They works but are not pretty and will probably have poor perfomance when compared to a type specific list generated by pre-processing/templating methods ala the Fortran Template Library approach.

everythingfunctional commented 4 years ago

Generic linked-list, or really any generic data structure, is really cumbersome with the current Fortran capabilities. They work, but you end up having to use a select type block every time you want to access the data. So for convenience you'd end up with some wrapper class or library, at which point you might as well have re-implemented for your specific use case. Until we get fully parameterized types or template capabilities I don't think these are a great idea.

victorsndvg commented 4 years ago

I think the supported data types should be wrapped with containers in order to be extendible. In the main issue (https://github.com/fortran-lang/stdlib/issues/1) containers are mentioned, but I don't see any other specific issue.

I think FPL (https://github.com/victorsndvg/FPL) contains a smart implementation strategy for supporting native data types and allow to extend to other user defined data types. It contains lists, hash tables, etc. All of them depend on containers (aka wrappers) in order to manage different data types.

I agree that with this kind of data types you don't get performance, but amazing flexibility. This kind of data types (usually) are not for computation purposes.

Edit:

ivan-pi commented 4 years ago

I think many of the projects in the list of popular projects contain linked list implementations. Perhaps it would be good to do a grep over all of those repositories to get a feeling for linked list usage in production codes (e.g. whether they use generic lists supporting multiple kinds or only specific ones for the intrinsic kinds and potentially derived types).

nshaffer commented 4 years ago

I agree with @everythingfunctional on this issue. There's a ton of up-front labor in implementing fully polymorphic containers, and I'm not convinced that they're that much more useful than having generic (but homogeneous) containers. That is, I don't think it's worthwhile to support, say, linked lists where each element is of arbitrary type.

The more common use case I find is to need a linked list of int32, say. Or a binary tree of class(my_derived_t). These can be implemented without select type all over the place. There's still the labor of implementing all the intrinsic types, but this can be templated.

Letting users make containers of derived types is tricker. The common solution is to provide an abstract base class that users need to extend in order to have containers of derived types. I think that solution kind of sucks, but I have an alternate idea... Just ship source code templates that implement each container for class(__T__) and then let users run sed s/__T__/mytype/g on it to produce derived type containers on demand. (This will be slightly more involved for, e.g., mapping types, but just slightly).

I confess I have not thought through if there is some great pitfall to this approach besides being slightly "icky" from a distribution p.o.v.

nncarlson commented 4 years ago

I'm in agreement with @nshaffer here. I've done the linked-list-of class(*) variables in my own library which I use as a backend for some very specific things where such generality is needed. Otherwise it is incredibly clunky to use with all the select type and isn't an acceptable for general use, imo.

Someone else seemed to suggest that perhaps performance shouldn't be a concern here. I think it would be a big mistake to ignore performance. Linked lists come with their intrinsic performance overhead that most would be aware of, but any implementation that significantly added to that I would find unacceptable to include in a standard library.

I think the best solution beyond intrinsic types, which could all have very performant implementations, would be, as @nshaffer suggested, to provide a literal template that a user could adapt for their particular case. In fact that's more or less what I do myself.

zbeekman commented 4 years ago

A note on performance:

1) Yes linked lists have some overhead compared with arrays 2) They perform well for sorted data, and in instances where you're always manipulating one end of the list or the other, e.g., stacks, but, in general are NOT constant time lookup for random access read or insertion unless you're always operating on data "nearby" 3) They are a building block component for hash tables which are in general constant time insertion and lookup. 4) In some cases where storage needs vary greatly and dynamically in complex ways pre-allocating a huge array may not be feasible and you may want/need to use a linked list

I think there is merit to providing classic data structures and algorithms. I would add hash tables to this list as well as binary-trees, octrees, K-D trees, and a number of others. Obviously they are not useful to all users and applications but having a decent implementation is worthwhile.

I agree that right now the select type combinatorial explosion makes unlimited polymorphics nearly useless, and very awkward. In my opinion better generic programming should be the highest priority for the next major standard revision.

certik commented 4 years ago

@zbeekman Generic programming will not make it to the next standard revision -- simply because there is no proposal that is ready. I think the latest most developed idea is pursued at https://github.com/j3-fortran/fortran_proposals/issues/125, and we need everybody's help to help transform the idea into a solid proposal. Once we have a proposal that is community backed, I'll be happy to bring it to the committee and try to get it into the next standard.

zbeekman commented 4 years ago

I know @rouson is working with Magne who leads the Bergen Language Design Lab and also @tclune on generics. They have something here but I don't know how up to date it is with their current efforts. Hopefully they can combine efforts and we can get something in, we'll see.

certik commented 4 years ago

Yes, the issue https://github.com/j3-fortran/fortran_proposals/issues/125 is the latest based on our discussion with Magne at the last meeting. Anyway, let's move the discussion about this there, I just wanted to point this out, that we need help.

tclune commented 4 years ago

In the mean time I have a project https://github.com/Goddard-Fortran-Ecosystem/gFTL which provides (by far less elegant means) a generic container system. Currently it supports Vector and Map (ala C++ STL), but also has Set which is used under the hood.

gFTL uses the C preprocessor and requires explicit instantiation, but is still a real game changer for doing some common operations within Fortran. I have a separate project gFTL-shared that provides common instantiations.

But I do look forward to the day that this could be done much more elegantly through a proper generic facility. (And yes, I realize that other preprocessors could do what I have done more elegantly than the C preprocessor, but ... cpp is already integrated into the build systems for the other projects I work with.

gronki commented 4 years ago

I agree here with @zbeekman that linked lists are essential and I think the approach to preallocate array is very ineffective (cause then you have to check for overflow and re-allocate it etc). I also sadly agree that this is undoable in the current Fortran. Gotta wait for generics (or hopefully an intrinsic highly-optimized types for lists and dicts).

LadaF commented 4 years ago

Thanks for mentioning my little example (should have been updated a long time ago). I do agree that select type is a big drawback and I use tye parametric version (using cpp macros) whereever possible. I think a linked list is a useful structure in many areas, and so are also binary trees and other. Especially hn doing more CS stuff, as opposed to just scienific computation.

And many thanks for linking the current work on a proposal. I have looked at Java interfaces as a possible alternative to multiple inheritance in normal dynamic-dispatch polymorphism a long time ago. I did not realize the closeness to Haskell type-classes and I did not realize it could be useful for compile-time parallelism. I ill have to take more time to study it. I am still worried whether it will be optimizable to be as efficient as are C++ templates.

BTW, Ondřej @certik I happen to be a member of the MFF XC skiing club you used to be in some years ago :) I know your fortran90.org and LFortran projects but I did not know you were in J3.

Dne so 4. 1. 2020 0:40 uživatel Dominik Gronkiewicz < notifications@github.com> napsal:

I agree here with @zbeekman https://github.com/zbeekman that linked lists are essential and I think the approach to preallocate array is very ineffective (cause then you have to check for overflow and re-allocate it etc). I also sadly agree that this is undoable in the current Fortran. Gotta wait for generics (or hopefully an intrinsic highly-optimized types for lists and dicts).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/68?email_source=notifications&email_token=AAFSIEJBJR6YAQX2UWDNTW3Q37EGNA5CNFSM4KCFV36KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEICKR7A#issuecomment-570730748, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFSIEMGUOBUGC5JHT2EKLTQ37EGNANCNFSM4KCFV36A .

certik commented 4 years ago

@LadaF nice to meet you! Small world. You should put your name and photo at your GitHub profile if you can.

ngs333 commented 3 years ago

The FTL ( https://github.com/SCM-NV/ftl Fortran Template Library) is a substantial library that is worth looking at.

rouson commented 3 years ago

Linked lists are a prime example of an anti-pattern, i.e., "a common response to a recurring problem that is usually ineffective and risks being highly counterproductive." In order of preference, I recommend

  1. Document clearly why linked lists are not supported for reasons related to code performance, correctness, complexity, and generality. Refer people to C++ inventor Bjarne Stroustrup's explanation of why to avoid linked lists.
  2. Use recursive allocatable components, a Fortran 2008 feature designed to support singly-linked lists that allow only unidirectional traversal. This at least resolves the correctness and complexity issues. Resolving the generality issue will require generic programming in Fortran 202Y. There's probably not much that can resolve the performance issues except warning the user about operations to avoid in performance-critical sections of code.
  3. If bidirectional traversal is required, consider whether arrays and indirect addressing could be used instead of pointers.
  4. Use pointers only as a last resort and only if the pointers are encapsulated in a way that precludes memory leaks and dangling pointers. Doing so likely requires reference counting, which is tricky to automate and get right.

If 1 is untenable, start with 2 to reduce development time and maintenance hassles.

ngs333 commented 3 years ago

Every data structure (linked list, array, binary tree, heap, priority queue, kd-trees, etc ) has a set of operations along with its computational complexity for each data operation (Cormen et all, Introduction to algorithms). Each has its place in larger algorithms (e.g. ray queries, computational geomety, etc) and programs. Many of them have a place in larger patterns. Having said that, there are some that others are build on (e.g. linked list; heap; priority queue). On top of that, there are some that are so simple that if you can't easily implement it in a language, then the language has serious problems and don't bother even trying the more sophisticated ones. I think the liked list is a good if not natural place to start.

ivan-pi commented 3 years ago

Great advice @rouson! It would fit well into a Fortran book similar to the Effective C++ series by Scott Meyers.

I never realized one can make a forward list using allocatable components. Recently, I was writing some code for operating with polygons. I decided to implement a polygon as a linked list of edges, half-knowing I might come to regret it, but still thought it will be a fun challenge. Two-weeks later I already regret it. The pointers make it difficult to write getter functions which are pure, the copy behavior is convoluted, and so is finalization.

Instead I might try to implement the polygon with recursive allocatable components:

type :: edge
  real, dimension(2) :: start, end
  type(edge), allocatable :: next
end type

type :: polygon
  integer :: nedges
  type(edge) :: first_edge
end type

(I am aware that this could also be solved with a simple array of vertex coordinates, and the solution above duplicates some information.)

Btw, your link in the 4th bullet point is broken.

rouson commented 3 years ago

@ivan-pi thanks! I just fixed the link in my 4th bullet point.

arjenmarkus commented 3 years ago

Well, I agree with the critique on the seemingly obvious implementation of linked lists. But the API together with indirect addressing might be worth investigating: what is the performance difference between a linked list based on recursive derived types with pointers/allocatables and linked lists based on arrays with indirect addressing. Plus the difficulty of correct implementation and "proving" the correctness. It seems a nice (little?) fun project to me :). And an efficient and effective implementation could be used as a building block more complex structures.

Op wo 3 mrt. 2021 om 23:59 schreef Damian Rouson notifications@github.com:

@ivan-pi https://github.com/ivan-pi thanks! I just fixed the link in my 4th bullet point.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/68#issuecomment-790132976, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR3O5S77525ATC4NYV3TB25LLANCNFSM4KCFV36A .

ChetanKarwa commented 3 years ago

Hello everyone, I am looking forward to solving this issue. I wish to contribute under GSOC 2021. I wish to start my work from now forth but I don't see a final conclusion on what kind of linked list is required here.

Mentors assigned for this project in GSOC are @arjenmarkus and @milancurcic. So, could any one of you finalize a clear picture of what kind of linked list is required?

A homogenous generic linked list (a linked list that contains only a single type of data i.e. decided by the user, this is similar to the list given in standard library in C++) [1,2,3,4,5,20]

or A heterogeneous generic linked list (A linked list that can contain any kind of data at its node be it an integer, character array, double or any other data type. this is similar to the list used in python) [1,'Hello',3.14,20]

arjenmarkus commented 3 years ago

Welcome to the project :).

The questions you raise are actually part of the project. The design questions need to be elaborated:

ChetanKarwa commented 3 years ago

Ok. The following are my views regarding the issue here.

LadaF commented 3 years ago

It is certainly possible to have a linked list or an array of container types that contains a class(*) item. The user of that list then will be responsible to know what types there might be inside and setup the appropriate select type type guards.

ChetanKarwa commented 3 years ago

I know that it is possible for one to make such a linked list but how can we expect our user to remember or to know what type there might be inside the node. User will have to maintain different data that describes the type carried by that node.

LadaF commented 3 years ago

No, you do not have to remember what data is in that node. You just have to know what kind of data you expect to encounter in your list. That is usually a reasonable assumption.

Dne ne 21. 3. 2021 20:52 uživatel ChetanKarwa @.***> napsal:

I know that it is possible for one to make such a linked list but how can we expect our user to remember or to know what type there might be inside the node. User will have to maintain different data that describes the type carried by that node.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/68#issuecomment-803649353, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFSIEMI23AZDKDSLDISNFLTEZFAVANCNFSM4KCFV36A .

ChetanKarwa commented 3 years ago

See I feel there is some misunderstanding here, I am also talking about the type of data stored in a list with mixed data types. For example: [ "someString", 45, 3.33, true, "ok"] So considering this kind of list user cannot remember what type of data is stored at what node because for example at the 2nd index its an integer(45) type and when we remove the first node the data-type at the 2nd index will be double (3.33). final list: [45, 3.33, true, "ok"]

LadaF commented 3 years ago

But, I repeat, you do not have to remember what kind of data is in which node of the list. You just need to know what data might be there and include it in the select type type guards when accessing the data in the list nodes.

The list itself may just return a class(*) pointer or something similar.

select type (res)
  type is(integer)
     ...
  type is (real)
     ...
  type is (character(*))
    ...
  type is (my_type_defined_elsewhere)
    ...

and so on

arjenmarkus commented 3 years ago

Indeed, via such a construction you can store and retrieve data of any type. Various alternatives are possible, though it might mean that we need to resort to a subroutine style of interface instead of purely a function style to have the compiler resolve the correct version of the retrieval routine. Although, consider this:

x = list%get(..)

assumes that the return type of the get method is always the same (so that conflicts with the heterogeneous character), but we may want to make it return a derived type and provide user-defined assignments to transfer the actual data.

Op zo 21 mrt. 2021 om 21:18 schreef Vladimír Fuka @.***

:

But, I repeat, you do not have to remember what kind of data is in which node of the list. You just need to know what data might be there and include it in the select type type guards when accessing the data in the list nodes.

The list itself may just return a class(*) pointer or something similar.

select type (res) type is(integer) ... type is (real) ... type is (character(*)) ... type is (my_type_defined_elsewhere) ...

and so on

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/68#issuecomment-803653129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR3XFBLNSJGDVYLNRWLTEZICPANCNFSM4KCFV36A .

ChetanKarwa commented 3 years ago

Okay now I get it we can use a polymorphic data type as a return type and the user can use "select type" on it to know what data type it is. So I would like to go forth and implement a simple heterogeneous doubly linked-list module for this project.

arjenmarkus commented 3 years ago

You can start with a simple data type that stores an arbitrary item and allows the user to retrieve it in a type-safe way. This will separate the concerns:

Op di 23 mrt. 2021 om 05:00 schreef ChetanKarwa @.***>:

Okay now I get it we can use a polymorphic data type as a return type and the user can use "select type" on it to know what data type it is. So I would like to go forth and implement a simple heterogeneous doubly linked-list module for this project.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/68#issuecomment-804598677, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR5MC3EZSWCCBBZWRHDTFAG7JANCNFSM4KCFV36A .

ChetanKarwa commented 3 years ago

I have implemented a rough sketch of the Heterogenous Doubly linked list. I have added APIs like append, get, remove. For now, I have added files to my GitHub repository. link Please check it out and guide me a little on how-to move forth from this point.

arjenmarkus commented 3 years ago

Nice start - I will have a look and come back to you about it,

Op di 23 mrt. 2021 om 20:56 schreef ChetanKarwa @.***>:

I have implemented a rough sketch of the Heterogenous Doubly linked list. I have added APIs like append, get, remove. For now, I have added files to my GitHub repository. link https://github.com/ChetanKarwa/Fortran-Linked-List Please check it out and guide me a little on how-to move forth from this point.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/68#issuecomment-805193636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR4XVFFVITZ75YSSXFLTFDW5BANCNFSM4KCFV36A .

rouson commented 3 years ago

I realize linked lists are popular and convenient, but I hope the issues around correctness and performance are being kept in mind. It would be nice if at least one of the two requisite pointer components is switched to allocatable to prevent even the possibility of a dangling pointer or memory leak. Other than an abstract, aesthetic notion of symmetry, is there any reason for both to be pointers?

A decade ago, when I worked on the Trilinos project, which already surpassed a million lines of code back then, contributors were forbidden from using raw pointers and this was a C++ project so that was a pretty radical position at the time. Contributors were required to use a reference-counted pointer template class instead — effectively what these days would be called a smart pointer. Fortran’s allocatable variables are our smart pointers. Whenever they can be used, I recommend using them over pointers in every case. There are so many subtle ways to get things wrong in hard-to-debug ways with pointers that I’ll avoid using a library that has pointers under the hood if I can possibly avoid it. I recently abandoned a library that was exhibiting strange behaviors that I couldn’t diagnose and was using pointers. You can’t just think about whether the code works now. Think about what happens when a naive new developer comes into a project, not knowing the best practices to keep the code safe from runtime errors. I basically won’t believe the correctness of any non-trivial code that uses pointers if that code doesn’t encapsulate the pointers in a way that guarantees automated reference-counting and if the project doesn’t enforce a policy that only the encapsulated form of reference-counted pointer ever be used in the project.

Someday I would love to see a calculation of the percentage of interpretation requests submitted to the Fortran committee related to pointers versus and a comparison to the amount of text in the standard related to pointers. I suspect the confusion pointers subtly introduce is disproportionately large relative to the parts of the standard that relate to pointers.

everythingfunctional commented 3 years ago

A singly linked list can be implemented using allocatables in Fortran. I'm with @rouson that a doubly linked list should only add the prev component to a node and the tail component as pointers, not have everything be pointers. That greatly reduces the possibility of memory leaks.

Second, all your components should be private. I don't think your node type needs to be public either.

Next, your list type should really only have the following public type-bound procedures:

You shouldn't have get or insert, as random access with a linked list is slow. If you need random access you shouldn't be using a linked list in the first place. You should switch to a vector, or some auto-balancing tree. In fact, it's been demonstrated that even in-order traversal of a linked list is often slower than other data structures.

Finally, I've had sufficient issues with compilers still being buggy around class(*) and causing memory leaks and segfaults. I'd define an abstract type, storable or something, with no components or TBPs, that anything to be stored in the list must extend from. That's always seemed to be much more reliable in my experience (but unfortunately means you can't store intrinsic types).

P.S. The visitor should be an abstract type with a single deferred procedure, visit with the following interface. See the visitor pattern.

abstract interface
  subroutine visit(self, item)
    class(visitor), intent(inout) :: self
    class(storable), intent(inout) :: item
  end subroutine
end interface
milancurcic commented 3 years ago

@ChetanKarwa Thanks and great start with the implementation. As you can probably tell, it will be a longer journey to nail down the requirements and implementation details, so don't expect this to be suitable for your application's patch requirement.

Instead, look for some simple issue to fix, perhaps add tests where there are incomplete, or improve docs, to submit a small and simple PR for your patch requirement.

milancurcic commented 3 years ago

You shouldn't have get or insert, as random access with a linked list is slow.

I don't think that's a good argument. An operation being slow doesn't make it not useful. I'd often prefer a slow operation than to carry two structures in memory.

rouson commented 3 years ago

Has a singly-linked list been developed? If not, how about starting with that. Once that is rock-solid, I wonder if any of the code could be reused in writing the doubly-linked list.

misael-diaz commented 2 years ago

I guess this will sound crazy but how about implementing a linked-list class that encapsulates a linked-list written in C? For me it would be very cool to see how would you do it and if it performs better than its implementation in pure FORTRAN.

I ask because I have been experimenting with that idea lately. To my surprise a singly linked-list written in C storing 65536 32-bit integers takes about 8 seconds to be destroyed in contrast to its FORTRAN counterpart which takes several minutes. Maybe my implementation in FORTRAN was no good and my conclusion is misguided. Nevertheless, I would like to know your thoughts.

certik commented 2 years ago

I still don't understand when you would use a linked list. Deallocating 65K integers should be immediate, not take 8s. Using a 1D array would achieve that.

tclune commented 2 years ago

@misael-diaz Encapsulating a C implementation is a reasonable approach for lists of (most) intrinsic types. However, it does not help much with linked-lists whose elements are non C-interoperable derived types. You can C_LOC and such to get around such obstacles, but then your wrapper becomes a bit more involved.

Note that linked-lists are almost never the ideal container for real world purposes. There's a nice video somewhere of Stroustrup discussing this at length. Often a simple Vector (ala STL terminology) will outperform in real use cases despite technically having a higher insertion/deletion cost.

If I get a chance, I'll try to time the clear() method on the Set container in my template library. Not quite a linked-list, as it uses a tree, but it would have a similar number of nodes to destroy. I'm fairly certain that it will be < 1 second, though definitely still slower than a Vector.

misael-diaz commented 2 years ago

@tclune thank you for the nice explanation.

Some potential users of the library like myself might be content if the linked-list supports at least intrinsic types. In numerical simulations it's not uncommon to use integers to identify objects and I suppose that in some real-life applications as well. Thus, such a basic linked-list might be of some use.

I did watch Bjarne Stroustrup's lecture on linked-lists but unfortunately there were no graphs comparing the performance of a vector and a linked-list. I am sure you are familiar with A Koenig's and B Moo's Accelerated C++ textbook. In page 87 they show a nice table that compares the runtime performance of a list and a vector as the number of records in a data file increases. For large data sets the list outperforms the vector whereas for smaller data sets they perform more or less similarly. I suppose that for even smaller data sets the vector would be the ideal data structure for that particular application.

Users of the library might want to try out different data structures to see which one yields the best performance. Besides performance, the users of might also choose a data structure that reduces the development time of the application that they are trying to develop so that they can move on with other pressing tasks.

It is my understanding that a tree is a data structure with rapid search capabilities so I suppose that it will outperform a singly linked-list, especially when destroying the elements that comprise it. Unless there's a better way I am not aware of, destroying a singly linked list requires traversing the list all the way to the last element, destroy it, and start all over again from the beginning of the list till the last element, destroy it, and repeat the process until there are no more elements left to destroy. There's a lot of overhead in doing that so it's likely that destroying the elements of a singly linked-list will show up as a bottleneck when profiling an application that uses it.

Again, thank you for taking the time to reply. I still would like to see how much time it takes for the tree you mentioned to clear its elements.

I am looking forward to use the facilities provided by the FORTRAN standard library in my work in the near future.

Bests regards

tclune commented 2 years ago

The set (tree) implementation still has to do pretty much the same thing, just not in the same order. It has to traverse down to each leaf node for the deletion and then traverse back up. The total number of nodes is N, the total number of deletions is N and the total number of steps up/down within the tree is O(N).

I've attached my small reproducer below. For 10^6 elements, my IntegerSet container builds in 2 seconds and destroys in 0.21 seconds. (Note that this is > 10x larger than the case you cite.) If I insert the elements in order, the construction time reduces to ~1 second.

program main
   use, intrinsic :: iso_fortran_env, only: INT64, REAL64
   use gFTL2_IntegerSet
   implicit none

   type(IntegerSet) :: s
   integer :: i, j, n
   real(kind=REAL64) :: x

   integer(kind=INT64) :: c0, c1, cr

   n = 10**6
   s = IntegerSet()

   call system_clock(c0, cr)
   do i = 1, N
      call random_number(x)
      j = 1 + floor(N*x)
      call s%insert(i)
   end do
   call system_clock(c1)

   print*,'construction time: ', real(c1-c0)/cr

   call system_clock(c0)
   call s%clear()
   call system_clock(c1)
   print*,'deconstruction time: ', real(c1-c0)/cr

end program main
tclune commented 2 years ago

It would be fairly straightforward to add List (STL name for doubly-linked list) to my package. Just a bit tedious though, so not something I'm likely to get to in the near future.

tclune commented 2 years ago

I'm sorry I misread your email above. You should be able to delete a doubly-linked list in O(N) steps. Your description suggests that your implementation is O(N^2). You can start at either end. Deleting a single entry involves first connecting the predecessor and successor nodes, and then deleting the current node. You then traverse in either order just once. Slightly special logic for the first/last nodes which lack a predecessor/successor, of course.

misael-diaz commented 2 years ago

@tclune thank you for your quick response, for sharing the runtime, and for point out inefficiencies in my implementation. I hope to try out the approach which you have described soon. It would be great to know the CPU specs, the compiler and the optimization flags (if any) that you used to compile the code.

tclune commented 2 years ago

2.4 GHz 8-Core Intel Core i9 (Recent vintage MacBook Pro)