chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.8k stars 421 forks source link

Unified interface for `List` and so called `Vector` #15913

Open rapiz1 opened 4 years ago

rapiz1 commented 4 years ago

Split from https://github.com/chapel-lang/chapel/issues/15668 When discussing the interface of Vector, @bradcray proposes a new style of initialization of List instead of adding Vector as a module.

List(impl.linked) vs. List(impl.vector) vs. List(impl.multivector)

This proves to be feasible and I have made a prototype with all the guidance from my mentors. Implementation: https://github.com/Rapiz1/gsoc2020-chapel-data-structures/blob/wrapper/Vector/src/ListNG.chpl Example: https://github.com/Rapiz1/gsoc2020-chapel-data-structures/blob/wrapper/Vector/example/listng.chpl

I would like to propose it more formally and look for more feedback. Also, some other design decisions have to be made before merging into master:

rapiz1 commented 4 years ago

FYI: @bradcray @krishnadey30 @e-kayrakli @cassella

e-kayrakli commented 4 years ago

How should the source be organized?

The Sort module could be a good inspiration. AFAIK it has different submodules for different sorting algorithms. There have been some recent changes, so I may be wrong.

How should tests be organized?

Not really a big deal, but the scheme you describe makes sense to me.

The name of vector is still not settled. I just use it here for convenience.

I think this hints at the main discussion points that we shouldn't be missing.

I think enum is a good idea and we can have listImpl.linked and listImpl.vector, although the name "vector" has been scrutinized as being an overloaded term in HPC. I don't have a better alternative to suggest at the moment.

bradcray commented 4 years ago

Cool, @Rapiz1!

I agree with Engin that submodules for the different implementations seems attractive from an organizational standpoint. I think I'd rather see "all routines for linked list" in one place and then "all routines for vector" in another rather than a more fine-grained interleaving of them. And I think your approach of using a record for the implementation of each will support that well. I think the big question is whether we put the submodules into the same source file or lean on the new capability to store submodules in a subdirectory, putting each implementation into its own file.

I think your proposed test organization and philosophy also seem reasonable. If the simple COMPOPTS to select between implementations approach didn't work well for some reason, we could consider investing more effort into a custom test script that knows how to select between .good files that follow a naming scheme as in the distribution robustness testing suite, but that would be nice to avoid if possible (and it seems like it should be).

I like an enum for selecting between the implementations because it supports a fixed number of choices without much opportunity for abuse, where I agree that naming is the hard part.

As alternatives to vector I might suggest buffer or dense or array. And then the current List implementation could be called something like multibuffer or multiarray (multidense doesn't work quite as well, though it could be acceptable).

Also, as I've said before, I don't think this complete architecture / refactoring needs to be completed before any vector-oriented list code is merged to master. It would be fine with me if Vector showed up as a separate module (by some name) as an initial step to lock some code in and then we worked on refactoring the various implementations to all live under List.

Tagging @dlongnecke-cray on this discussion since he wrote the original list code.

dlongnecke-cray commented 4 years ago

Also, as I've said before, I don't think this complete architecture / refactoring needs to be completed before any vector-oriented list code is merged to master. It would be fine with me if Vector showed up as a separate module (by some name) as an initial step to lock some code in and then we worked on refactoring the various implementations to all live under List.

I'm not sure what the status of the constrained generics effort is now, but if they're in any way feasible at all I'd want to hold out on any extensive modifications to container API until after we've explored interfaces as an alternative.

Which is to say I agree with you in wanting to see incremental merges...

bradcray commented 4 years ago

I disagree that we should wait on supporting multiple implementations of list until constrained generics come on-line. While I agree that having constrained generics would be helpful, I don't see any practical benefit to waiting to provide users with multiple list implementations (and don't think retrofitting constrained generics into a multi-implementation list will be much worse than any other pre-existing module code).

rapiz1 commented 4 years ago

As alternatives to vector I might suggest buffer or dense or array. And then the current List implementation could be called something like multibuffer or multiarray (multidense doesn't work quite as well, though it could be acceptable).

I like the idea of buffer vs multibuffer because I think it's a very good and clear hint of the internal. array is kind of generic and doesn't give me thoughts other than List is somehow relevant to Chapel Array. As for dense, I'm not familiar with its meaning in programming so I don't have opinions.

I think the big question is whether we put the submodules into the same source file or lean on the new capability to store submodules in a subdirectory, putting each implementation into its own file.

Before creating this issue, I checked out the existing standard modules and found that they're all held in their own files. So I proposed a single file to respect that. If that's not an issue, I prefer putting in different files. Adding up all the code, it could be up to ~3000 lines which could make trouble for browsing and editing. Also, different implementations are independent so it makes sense to separate them.

bradcray commented 4 years ago

I checked out the existing standard modules and found that they're all held in their own files.

We only added the ability to split submodules into their own files at the very end of the last release cycle, and haven't created new standard modules that contain submodules since then (that I'm aware of), so I think current practice just reflects how new the feature is and wouldn't let existing practice steer you away from splitting across files if that's your preference.

Some other terms that occurred to me is for the "vector-style" list is packed or contiguous. I also wanted to clarify that I'm open to using vector as an enum value if people found that preferable.

dlongnecke-cray commented 4 years ago

I personally love the terms buffer and multibuffer myself. I worry that array would cause confusion (as well as vector, but for different reasons).

e-kayrakli commented 4 years ago

By looking at the discussion here, I think we are pretty much converging on listImpl.buffer and listImpl.multibuffer. I am OK with that idea.

I think I liked the term vector mostly because of precedence in C++, but there if we agree on something that describes the implementation better I am all for it.

mppf commented 4 years ago

On the question of having a single list type that is parameterized vs. having separate types that do that, I have a preference for separate types, generally speaking.

I guess for me it depends on whether or not the parameterized variants are fundamentally similar and needed in the standard library (single-buffer vs multi-buffer is more plausible for that) or if they are part of an infinite universe of types (in which case separate types seems better).

I think the test I would probably use for parameterizing vs. separate types is what the documentation looks like. I think that the documentation should cover what is O(1) and what is O(n) and what is parallel safe and what isn't (or what is invalidating and what isn't). I think this documentation will be more understandable for separate types if there are significant differences. (In other words - if we choose to parameterize - will the documentation cover each variant for each function? That wouldn't be so great.)

vasslitvinov commented 4 years ago

Following Michael's comment, I suggest also thinking whether these variants have the same interface, i.e. offer the same set of methods. At least if the set of methods is almost the same, OK if each has a couple of methods not available with other variants.

Also, obviously there needs to be a default i.e. the variant parameter can be omitted.

rapiz1 commented 4 years ago

Vector has already been merged to test/library/draft/ in https://github.com/chapel-lang/chapel/pull/16048

I think the big question is whether we put the submodules into the same source file or lean on the new capability to store submodules in a subdirectory, putting each implementation into its own file.

@bradcray I prefer putting submodules in subdirectories. Large files (in this case, maybe 3000+ lines) are hard to edit and review.

rapiz1 commented 4 years ago

One remaining question is how should we arrange the documentation.

A simple and effective method is just to follow the submodule hierarchy and generate docs. This gives out different pages for different implementations. The benefit is that it's more flexible and we have more control over each page. The disadvantage is that it may be tedious for having different pages whose content is kind of overlapping. And a user have to refer to different pages for different structures.

Another method is that putting all on one page. Under each method, listing which implementation supports and which doesn't. Also document the time complexity and so on. The benefit is that it may be easier to refer to and compare all implementations at one time if there's only one page. But it could also be confusing if descriptions aren't listed in an appropriate way.

Personally, I prefer the first method. As for the second method, I don't have a clear idea about how should the description about each method look like. And I think it could be confusing from a user's perspective.

@e-kayrakli @bradcray

bradcray commented 4 years ago

As a consumer of the documentation, I think I'd prefer for it to be on one page as that seems easiest to digest and to compare between the different options. Or at least, to have some sort of unified summary at the top of the List page that listed the options and contrasted between them to help me decide which I wanted (maybe by giving asymptotic running times for their operations; or other characteristics that varied between them?)

In the first approach, what kind of information would appear on each submodule page? For example, it doesn't seem as though a list.insert() or list.pop() method would need or want to be documented separately for each submodule, but once for the list as a whole.

e-kayrakli commented 4 years ago

I agree with the general sentiment that seeing the differences between implementations can help us decide whether types should be separate or be parametrized.

@Rapiz1 -- can you create a short list/table of similarities and differences between your vector implementation and the current List? The things to pay attention to are:

As far as I remember, the differences are small. And ideally I like the idea of having less data structures (and putting everything in a single type with parametrization), however, if the differences are just too many, we can separate the types, too.

rapiz1 commented 4 years ago

@e-kayrakli I also listed UnrolledLinkedList here as @bradcray has said it could be part of the list.

Supported/unsupported methods

UnrolledLinkedList doesn't support sort

Methods that have different complexities

Insertion/Deletion

Affected: insert, remove.

Vector and List have the same complexity. O(N)

Assuming that UnrolledLinkedList has B as the capacity for each linked node, then it's O(N/B+B). B is a constant factor specified by the users. To be strict, it shouldn't appear in big-O notation. It's important in practice so I wrote it here.

Append/Pop

Affected: append, pop. extend contains multiple calls to append.

Vector, List: O(lgN) UnrolledLinkedList: O(B)

Indexing

Affected: this, last

Vector, List: O(1). But Vector is slightly faster at a constant factor level. UnrolledLinkedList: O(N/B + B)

Methods that behave differently

No difference

Supported/unsupported types

UnrolledLinkedList and Vector don't support types that can't be default initialized.

e-kayrakli commented 4 years ago

I personally think that we can handle these differences under the same documentation page. This can be either having some sort of "introduction to implementation types" section in the header comment of List docs, or individual notes in functions that have differences. Or both.

At least in the short term where we consider merging current List and your Vector, the only difference seems to be that Vector doesn't support types that can't be default inited, which can be a simple note/warning in the List docs header. So, we can punt the discussion about unrolled linkedlist for now.

e-kayrakli commented 4 years ago

I had a chat with @mppf yesterday and I am very much on the fence as to whether we should parametrize the type, or should have separate types for separate implementations.

For parametrization:

For separation:

Constrained generics can help to some extent where we can have smaller interfaces and different data structures can implement multiple interfaces to make the functionality clearer. But even than, I think we should decide what our data structure support to look like, where the main question that we should answer is whether or not to separate implementations (buffer, multibuffer, treap, skiplist, unrolled link list etc) from concepts (list, set, map)

bradcray commented 4 years ago

Eventually we'll have different enough implementations of container X that we'd want to separate them.

Is it guaranteed that the number of implementations of list will continue to grow unbounded? I'd like to think that 3-4 would be sufficient, but maybe that's naive...

Once I have 10 list implementations (say... not that I ever hope to have that many), what is the pain point that would cause me (as a developer) to start to want to separate them into distinct types?

mppf commented 4 years ago

I know your question is probably for @e-kayrakli and I hope he responds separately with his viewpoint.

For me there are 2 expected pain points for the param strategy if there are many implementations:

  1. it is necessary to modify the standard list in order to add a new list variant. (Because, in general, users will not be willing or able to modify the standard library). Additionally, supposing somebody wants to contribute a new container variation (say, a B-tree map). Do we want them to have to write it somehow modules/standard/Map.chpl in the first place? If not, they (and any users of the type) will have to change their code when the type is migrated into a parameterized map type.
  2. if the number of implementations is large, it will become awkward to write where clauses or param conditionals for all of the separate proc append (or whatever). Making list use forwarding to wrap separate types that are chosen based on a param argument to the initializer would help, but then we are separating them anyway (in implementation if not in user view).

Is it guaranteed that the number of implementations of list will continue to grow unbounded?

list is relatively simple and it is more plausible that a limited number of implementations will be sufficient. However when you include parallelism/concurrency strategy there seems to be much more room for a larger number. Immutable lists? Bounded lists that block? Copy-on-write lists? Lists using compare-and-swap instead of locks?

It might be that this discussion really is only focused on list as the issue says, but I think rather we are making a precedent for ourselves with the other collection types. In my view, the param approach seems particularly problematic for map and set, where there are quite a few more variations. There are lots of the research papers on different maps and on different strategies for supporting parallelism. Just for the top-level approach - there are at the very least hashtables, comparison trees, and radix trees. Within each of those categories there are quite a few variations that might be important and requested by users.

For example, in Java, the number of implementations of map might at first look manageable (3 from the table say here ) but that doesn't include concurrent data structures and there end up being 2 more map variations. But then there are actually 31 implementations of map in the Java standard library but many of these are specialized for particular key/value types or cases where the map is contributing to the feature set of another type. You can also see this third-party collections library adds immutable maps, range maps, and bimaps.

e-kayrakli commented 4 years ago

I agree with Michael's points above (while still being on the fence about the general direction)

Once I have 10 list implementations (say... not that I ever hope to have that many), what is the pain point that would cause me (as a developer) to start to want to separate them into distinct types?

We are currently looking at only a handful of implementations and thinking that they are similar enough to be under the same type. But it is likely that as the number of different implementations increase, so will the differences. And as we have more implementations, the question of whether those implementations should be under an existing type will get more and more difficult to answer. In other words, having only a few examples of these scenarios doesn't feel like it is enough to set a precedent that maybe difficult to follow down the road.

damianmoz commented 3 years ago

Please do not call it vector. It certainly is not that.