Open rapiz1 opened 4 years ago
FYI: @bradcray @krishnadey30 @e-kayrakli @cassella
How should the source be organized?
The Sort
module could be a good inspiration. AFAIK it has different submodules for different sorting algorithms. There have been some recent changes, so I may be wrong.
How should tests be organized?
Not really a big deal, but the scheme you describe makes sense to me.
The name of vector is still not settled. I just use it here for convenience.
I think this hints at the main discussion points that we shouldn't be missing.
I think enum is a good idea and we can have listImpl.linked
and listImpl.vector
, although the name "vector" has been scrutinized as being an overloaded term in HPC. I don't have a better alternative to suggest at the moment.
Cool, @Rapiz1!
I agree with Engin that submodules for the different implementations seems attractive from an organizational standpoint. I think I'd rather see "all routines for linked list" in one place and then "all routines for vector" in another rather than a more fine-grained interleaving of them. And I think your approach of using a record for the implementation of each will support that well. I think the big question is whether we put the submodules into the same source file or lean on the new capability to store submodules in a subdirectory, putting each implementation into its own file.
I think your proposed test organization and philosophy also seem reasonable. If the simple COMPOPTS to select between implementations approach didn't work well for some reason, we could consider investing more effort into a custom test script that knows how to select between .good files that follow a naming scheme as in the distribution robustness testing suite, but that would be nice to avoid if possible (and it seems like it should be).
I like an enum for selecting between the implementations because it supports a fixed number of choices without much opportunity for abuse, where I agree that naming is the hard part.
As alternatives to vector
I might suggest buffer
or dense
or array
. And then the current List implementation could be called something like multibuffer
or multiarray
(multidense
doesn't work quite as well, though it could be acceptable).
Also, as I've said before, I don't think this complete architecture / refactoring needs to be completed before any vector-oriented list code is merged to master. It would be fine with me if Vector showed up as a separate module (by some name) as an initial step to lock some code in and then we worked on refactoring the various implementations to all live under List
.
Tagging @dlongnecke-cray on this discussion since he wrote the original list code.
Also, as I've said before, I don't think this complete architecture / refactoring needs to be completed before any vector-oriented list code is merged to master. It would be fine with me if Vector showed up as a separate module (by some name) as an initial step to lock some code in and then we worked on refactoring the various implementations to all live under List.
I'm not sure what the status of the constrained generics effort is now, but if they're in any way feasible at all I'd want to hold out on any extensive modifications to container API until after we've explored interfaces as an alternative.
Which is to say I agree with you in wanting to see incremental merges...
I disagree that we should wait on supporting multiple implementations of list until constrained generics come on-line. While I agree that having constrained generics would be helpful, I don't see any practical benefit to waiting to provide users with multiple list implementations (and don't think retrofitting constrained generics into a multi-implementation list will be much worse than any other pre-existing module code).
As alternatives to vector I might suggest buffer or dense or array. And then the current List implementation could be called something like multibuffer or multiarray (multidense doesn't work quite as well, though it could be acceptable).
I like the idea of buffer
vs multibuffer
because I think it's a very good and clear hint of the internal. array
is kind of generic and doesn't give me thoughts other than List
is somehow relevant to Chapel Array. As for dense
, I'm not familiar with its meaning in programming so I don't have opinions.
I think the big question is whether we put the submodules into the same source file or lean on the new capability to store submodules in a subdirectory, putting each implementation into its own file.
Before creating this issue, I checked out the existing standard modules and found that they're all held in their own files. So I proposed a single file to respect that. If that's not an issue, I prefer putting in different files. Adding up all the code, it could be up to ~3000 lines which could make trouble for browsing and editing. Also, different implementations are independent so it makes sense to separate them.
I checked out the existing standard modules and found that they're all held in their own files.
We only added the ability to split submodules into their own files at the very end of the last release cycle, and haven't created new standard modules that contain submodules since then (that I'm aware of), so I think current practice just reflects how new the feature is and wouldn't let existing practice steer you away from splitting across files if that's your preference.
Some other terms that occurred to me is for the "vector-style" list is packed
or contiguous
. I also wanted to clarify that I'm open to using vector
as an enum value if people found that preferable.
I personally love the terms buffer
and multibuffer
myself. I worry that array
would cause confusion (as well as vector
, but for different reasons).
By looking at the discussion here, I think we are pretty much converging on listImpl.buffer
and listImpl.multibuffer
. I am OK with that idea.
I think I liked the term vector
mostly because of precedence in C++, but there if we agree on something that describes the implementation better I am all for it.
On the question of having a single list
type that is parameterized vs. having separate types that do that, I have a preference for separate types, generally speaking.
I guess for me it depends on whether or not the parameterized variants are fundamentally similar and needed in the standard library (single-buffer vs multi-buffer is more plausible for that) or if they are part of an infinite universe of types (in which case separate types seems better).
I think the test I would probably use for parameterizing vs. separate types is what the documentation looks like. I think that the documentation should cover what is O(1) and what is O(n) and what is parallel safe and what isn't (or what is invalidating and what isn't). I think this documentation will be more understandable for separate types if there are significant differences. (In other words - if we choose to parameterize - will the documentation cover each variant for each function? That wouldn't be so great.)
Following Michael's comment, I suggest also thinking whether these variants have the same interface, i.e. offer the same set of methods. At least if the set of methods is almost the same, OK if each has a couple of methods not available with other variants.
Also, obviously there needs to be a default i.e. the variant parameter can be omitted.
Vector has already been merged to test/library/draft/
in https://github.com/chapel-lang/chapel/pull/16048
I think the big question is whether we put the submodules into the same source file or lean on the new capability to store submodules in a subdirectory, putting each implementation into its own file.
@bradcray I prefer putting submodules in subdirectories. Large files (in this case, maybe 3000+ lines) are hard to edit and review.
One remaining question is how should we arrange the documentation.
A simple and effective method is just to follow the submodule hierarchy and generate docs. This gives out different pages for different implementations. The benefit is that it's more flexible and we have more control over each page. The disadvantage is that it may be tedious for having different pages whose content is kind of overlapping. And a user have to refer to different pages for different structures.
Another method is that putting all on one page. Under each method, listing which implementation supports and which doesn't. Also document the time complexity and so on. The benefit is that it may be easier to refer to and compare all implementations at one time if there's only one page. But it could also be confusing if descriptions aren't listed in an appropriate way.
Personally, I prefer the first method. As for the second method, I don't have a clear idea about how should the description about each method look like. And I think it could be confusing from a user's perspective.
@e-kayrakli @bradcray
As a consumer of the documentation, I think I'd prefer for it to be on one page as that seems easiest to digest and to compare between the different options. Or at least, to have some sort of unified summary at the top of the List page that listed the options and contrasted between them to help me decide which I wanted (maybe by giving asymptotic running times for their operations; or other characteristics that varied between them?)
In the first approach, what kind of information would appear on each submodule page? For example, it doesn't seem as though a list.insert()
or list.pop()
method would need or want to be documented separately for each submodule, but once for the list as a whole.
I agree with the general sentiment that seeing the differences between implementations can help us decide whether types should be separate or be parametrized.
@Rapiz1 -- can you create a short list/table of similarities and differences between your vector implementation and the current List? The things to pay attention to are:
foo
can behave completely identically, unless the list elements are owned non-nilable
etc.As far as I remember, the differences are small. And ideally I like the idea of having less data structures (and putting everything in a single type with parametrization), however, if the differences are just too many, we can separate the types, too.
@e-kayrakli I also listed UnrolledLinkedList here as @bradcray has said it could be part of the list.
UnrolledLinkedList doesn't support sort
Affected: insert
, remove
.
Vector and List have the same complexity. O(N)
Assuming that UnrolledLinkedList has B as the capacity for each linked node, then it's O(N/B+B). B is a constant factor specified by the users. To be strict, it shouldn't appear in big-O notation. It's important in practice so I wrote it here.
Affected: append
, pop
. extend
contains multiple calls to append
.
Vector, List: O(lgN) UnrolledLinkedList: O(B)
Affected: this
, last
Vector, List: O(1). But Vector is slightly faster at a constant factor level. UnrolledLinkedList: O(N/B + B)
No difference
UnrolledLinkedList and Vector don't support types that can't be default initialized.
I personally think that we can handle these differences under the same documentation page. This can be either having some sort of "introduction to implementation types" section in the header comment of List
docs, or individual notes in functions that have differences. Or both.
At least in the short term where we consider merging current List
and your Vector
, the only difference seems to be that Vector doesn't support types that can't be default inited, which can be a simple note/warning in the List docs header. So, we can punt the discussion about unrolled linkedlist for now.
I had a chat with @mppf yesterday and I am very much on the fence as to whether we should parametrize the type, or should have separate types for separate implementations.
For parametrization:
For separation:
parSafe
, for example. But an implementation of a container can be sufficiently different than others when parallel-safety is concerned, and very similar otherwise. How are we going to handle that?Constrained generics can help to some extent where we can have smaller interfaces and different data structures can implement multiple interfaces to make the functionality clearer. But even than, I think we should decide what our data structure support to look like, where the main question that we should answer is whether or not to separate implementations (buffer
, multibuffer
, treap
, skiplist
, unrolled link list
etc) from concepts (list
, set
, map
)
Eventually we'll have different enough implementations of container X that we'd want to separate them.
Is it guaranteed that the number of implementations of list will continue to grow unbounded? I'd like to think that 3-4 would be sufficient, but maybe that's naive...
Once I have 10 list implementations (say... not that I ever hope to have that many), what is the pain point that would cause me (as a developer) to start to want to separate them into distinct types?
I know your question is probably for @e-kayrakli and I hope he responds separately with his viewpoint.
For me there are 2 expected pain points for the param strategy if there are many implementations:
list
in order to add a new list
variant. (Because, in general, users will not be willing or able to modify the standard library). Additionally, supposing somebody wants to contribute a new container variation (say, a B-tree map
). Do we want them to have to write it somehow modules/standard/Map.chpl
in the first place? If not, they (and any users of the type) will have to change their code when the type is migrated into a parameterized map
type.where
clauses or param conditionals for all of the separate proc append
(or whatever). Making list
use forwarding
to wrap separate types that are chosen based on a param
argument to the initializer would help, but then we are separating them anyway (in implementation if not in user view).Is it guaranteed that the number of implementations of list will continue to grow unbounded?
list
is relatively simple and it is more plausible that a limited number of implementations will be sufficient. However when you include parallelism/concurrency strategy there seems to be much more room for a larger number. Immutable lists? Bounded lists that block? Copy-on-write lists? Lists using compare-and-swap instead of locks?
It might be that this discussion really is only focused on list
as the issue says, but I think rather we are making a precedent for ourselves with the other collection types. In my view, the param approach seems particularly problematic for map
and set
, where there are quite a few more variations. There are lots of the research papers on different maps and on different strategies for supporting parallelism. Just for the top-level approach - there are at the very least hashtables, comparison trees, and radix trees. Within each of those categories there are quite a few variations that might be important and requested by users.
For example, in Java, the number of implementations of map might at first look manageable (3 from the table say here ) but that doesn't include concurrent data structures and there end up being 2 more map variations. But then there are actually 31 implementations of map in the Java standard library but many of these are specialized for particular key/value types or cases where the map is contributing to the feature set of another type. You can also see this third-party collections library adds immutable maps, range maps, and bimaps.
I agree with Michael's points above (while still being on the fence about the general direction)
Once I have 10 list implementations (say... not that I ever hope to have that many), what is the pain point that would cause me (as a developer) to start to want to separate them into distinct types?
We are currently looking at only a handful of implementations and thinking that they are similar enough to be under the same type. But it is likely that as the number of different implementations increase, so will the differences. And as we have more implementations, the question of whether those implementations should be under an existing type will get more and more difficult to answer. In other words, having only a few examples of these scenarios doesn't feel like it is enough to set a precedent that maybe difficult to follow down the road.
Please do not call it vector
. It certainly is not that.
Split from https://github.com/chapel-lang/chapel/issues/15668 When discussing the interface of
Vector
, @bradcray proposes a new style of initialization ofList
instead of addingVector
as a module.This proves to be feasible and I have made a prototype with all the guidance from my mentors. Implementation: https://github.com/Rapiz1/gsoc2020-chapel-data-structures/blob/wrapper/Vector/src/ListNG.chpl Example: https://github.com/Rapiz1/gsoc2020-chapel-data-structures/blob/wrapper/Vector/example/listng.chpl
I would like to propose it more formally and look for more feedback. Also, some other design decisions have to be made before merging into master:
How should tests be organized? Most of the tests should work for both Vector and List. For these tests, I think a param determining
impl
can be added to existing tests and usecompopts
to run them twice. For implementation specified tests, they should be held in different directories. The hierarchy will look like this:How should the source be organized? Put all of the code in
List.chpl
. Make submodules namedimplList
,implVector
Naming The wrapper uses enum to select an implementation. So what could be good names? Possible Candidates:
listImpl.linked
andlistImpl.vector
listImpl.buffer
andlistImpl.multibuffer
listImpl.array
andlistImpl.multiarray
listImpl.dense
andlistImpl.multidense