Derecho-Project / derecho

The main code repository for the Derecho project.
BSD 3-Clause "New" or "Revised" License
186 stars 48 forks source link

Design, implement, and contribute Java API for Derecho #26

Open scottslewis opened 6 years ago

scottslewis commented 6 years ago

This issue to track design work on Java API for Derecho. Sub-issues will be created as needed, and technical discussions/proposals can take place on this issue.

scottslewis commented 6 years ago

I've been using the Emulab/Cloudlab to build Derecho successfully, and have been examining the test/experiment code and derecho APIs.

I have a couple of API design questions and thoughts below. Feel free to point me to other docs (e.g. rdma) if appropriate.

The server_rank: is it always 0? Or can it be some > 0 value...but it's always the minimum for the group?

The node_rank: Does it have some semantics in addition to group uniqueness? If so, what is that?

I see that there are some default port values, but I don't understand whether (e.g.) those are fixed or whether they can/could be specified by developers.

Is there any notion (in rdma and/or derecho) of group authentication/access control? Does that make sense?

Thoughts

I would be inclined to separate the 'create Group instance' from the 'connect to Group' operations in the Java API (at least optionally). A couple of reasons for this are that a) it makes for simpler constructors; b) it allows the developer to have finer control. c) Java has such a convention for communications libs (e.g. socket, others); d) It makes it easier to introduce callbacks/listeners for various operations...e.g. connect/disconnect...so that the Group behavior can be extended via delegation rather than inheritance. e) It avoids having constructors throw exceptions (which can be awkward). I know it introduces some API complexity but I would say it's worth it. Thoughts?

I think the java API strategy should be:

1) Start with a few C++ 'wrapper' classes, that expose the Java API that we define.

2) Have these wrapper classes use the existing Derecho API's to implement their behavior.

3) Add API to these wrapper classes to incorporate/expose via Java more parts of the Derecho API.

The idea will be to incrementally add to the Java API while trying to limit the required changes/additions to the existing lower-level Derecho APIs.

I'm thinking of starting with using SWIG: http://www.swig.org/

SWIG will generate the necessary java and jni code, given the wrapper classes described above. And hopefully it will have the positive side effect of being able to also produce api wrappers for other languages (e.g. python) without a huge amount of additional work.

Will the libfabric work allow Derecho to run on 'plain ol tcp/ip'...e.g. for dev, debug, testing on non-rdma-enabled hardware/networks?

Ok, this is enough to start. Thanksinadvance.

sagarjha commented 6 years ago

We can discuss some of the design questions with everyone in our group. But, I will answer some of the questions: The rank is different from id. Each node has a unique id which is any 32 bit unsigned integer. Any node with an id can be the leader. Rank is just an index of a node in the list of members. Therefore, ranks are always in sequence from 0, 1, ..., num_members - 1. Rank 0 node is always the leader. It can have any id. When the leader node exits the group, the next highest ranked node gets a rank of 0 in the new view and becomes the leader. I apologize if the server_rank variable name caused any confusion. We have erroneously used rank and id interchangeably in the variable names which makes it hard to understand. The server_rank variable in derecho_bw_test etc. can all be replaced by server_id and that does not have to be 0. In all experiments, you need to start the starting leader node first. You just need a way to decide which node calls this special leader constructor when constructing the group.

The ports are indeed specified directly in the code. Right now you have to change the default values in the code to change them. We plan to use config files later (Weijia might be already using them in some other branch).

There is a concept of protection domain in RDMA (http://www.rdmamojo.com/2012/08/24/ibv_alloc_pd/) but that is not for access control really. For all practical purposes, anyone can join the group. We are primarily targeting the datacenter use-case where all the nodes are in your control.

sagarjha commented 6 years ago

I don't think you can separate the group construction from group join. What you need to construct isn't much clear until you have joined the group. You are not part of a group unless you have joined so the group doesn't exist until then. What this means for the code is that a node joining leads to a view change that computes the group membership and membership of the subgroups. Based on that, you construct objects for the subgroups the joining node is a part of. The layout of the SST row is also decided after group configuration is clear so even the SST cannot be constructed before the join.

mpmilano commented 6 years ago

While Sagar is right that many mechanics of the group are unknown until join occurs, I think there is space to do as Scott suggests -- to detatch creation of the Group object from initialization of the group itself. The group object before initialization would not be particularly featureful, but it could still allow one to register handlers or inspect static configuration state. Effectively, any operation performed on the group before "true" initialization would be buffered pending actual initialization.

That being said, Scott has hit upon a key distinction between the C++ (as interpreted by Derecho) and Java approaches to object use! Derecho fully embraces RAII --- and correspondingly avoids exposing objects to the user unless they have been fully initialized. Making this change within the C++ API would result in something of a mismatch of patterns with the rest of Derecho (and much of the standard C++ library landscape on which we depend). Perhaps we could add some sort of "ProtoGroup" class, which encapsulates Group and defers its construction, and expose this new class to Java as its notion of "Group".

mpmilano commented 6 years ago

A quick note on extension vs. encapsulation for extensibility! Derecho's primary extensibility mechanisms are via templates -- it uses "concept-based", or "constraint-based" extensibility rather than "inheritance-based", if you will. Calls to methods of Derecho objects, including user-supplied Replicateds, are always statically-bound at compile time. Consequently, any extensions of classes which are not supplied directly to Derecho (via templates) won't be used.

This is relevant when considering the Java-equivalents of Replicated objects. If these are to be an interface-based API which relies on dynamic dispatch to find method implementations, we'll have to do something fancy to express that to the C++ side. It's also relevant when considering subclassing Group on the java side; attempts to override group methods are unlikely to be respected by the C++ side.

KenBirman commented 6 years ago

I think the plan was really that we would do a DLL with the “uncooked” Derecho APIs, since those basically don’t need to be templated and just take byte array arguments. Then Scott would latch on that and do his own marshall/demarshall. Sagar was going to add scatter gather in any case, so at that point if Scott has a scatter-gather ready set of arguments, we could also support that from Java.

So in some sense, Scott’s world would be strongly typed, via Java types, but his way of using Derecho would hide all the type information entirely and just show us Java-allocated and pinned byte buffers, in registered regions of memory (without those, we have no choice except to copy because of registration issues, and because of the risk that a pointer might try to swizel under out feet if garbage collection/compaction suddenly runs). (this said, the new Mellanox firmware actually doesn’t require that memory be registered and can even handle page faults).

From: Matthew Milano [mailto:notifications@github.com] Sent: Friday, April 27, 2018 2:41 PM To: Derecho-Project/derecho-unified derecho-unified@noreply.github.com Cc: Ken Birman ken@cs.cornell.edu; Manual manual@noreply.github.com Subject: Re: [Derecho-Project/derecho-unified] Design, implement, and contribute Java API for Derecho (#26)

A quick note on extension vs. encapsulation for extensibility! Derecho's primary extensibility mechanisms are via templates -- it uses "concept-based", or "constraint-based" extensibility rather than "inheritance-based", if you will. Calls to methods of Derecho objects, including user-supplied Replicateds, are always statically-bound at compile time. Consequently, any extensions of classes which are not supplied directly to Derecho (via templates) won't be used.

This is relevant when considering the Java-equivalents of Replicated objects. If these are to be an interface-based API which relies on dynamic dispatch to find method implementations, we'll have to do something fancy to express that to the C++ side. It's also relevant when considering subclassing Group on the java side; attempts to override group methods are unlikely to be respected by the C++ side.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/Derecho-Project/derecho-unified/issues/26#issuecomment-385058737, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AWDC7yBglGNYvOwrJjerLj8gQZMhzs87ks5ts2YkgaJpZM4TIq9T.

scottslewis commented 6 years ago

Thanks to both Sagar and Matthew for your answers and comments!

Some follow-up questions:

The rank is different from id. Each node has a unique id which is any 32 bit unsigned integer.

How is this unique id (not rank) created/assigned? In the examples/experiments that I've looked at the Group constructor always is passed a node_rank (really the id?), and the node_address (ip_addr).

WRT the group leader...in the examples...upon construction the rank is 0. I'm just trying to understand what Group state is static (i.e. id), and which is dynamic (e.g. rank) and how to initialize these values with minimum required input from the developer.

The ports are indeed specified directly in the code. Right now you have to change the default values in the code to change them. We plan to use config files later (Weijia might be already using them in some other branch).

Ok. So the Group creation/construction API should probably accommodate dev-specified ports somehow...e.g. config files, added parameters/options, etc. Note that a common approach these days is to pass in a uri (e.g. derecho://[:port]/id ). I understand this may not fit for derecho/rdma, but just presenting some design alternatives.

For all practical purposes, anyone can join the group. We are primarily targeting the datacenter use-case where all the nodes are in your control.

I understand. It might be worth thinking about further, just to support other use cases (i.e. no datacenter use, but rather 'fog' or 'smart edge').

Thanks Sagar for your answers to my questions! Good to meet you btw.

That being said, Scott has hit upon a key distinction between the C++ (as interpreted by Derecho) and Java approaches to object use!

Indeed!...I suspect that's due to my own API design experience being primarily in Java. I've found particularly for communications, where there is state associated with things like connections (connected) or 'sessions', and lots of need to have things like listeners/callbacks that are called asynchronously as things happen (e.g. data are received, membership changes, etc) that having everything be 'setup' by the constructor can make things very complex (i.e. large and ever-growing set of constructor arguments).

I certainly get that the template (compile-time) aspect to Derecho is by design...for some use cases and performance requirements I think that is necessary. However, I'm hoping that opening up a Java-style API alongside the C/C++ API would/will allow Derecho to be more easily used by plenty of Java programmers, and used by people like me to implement (e.g.) highly-performant remote services.

BTW Michael...what's the acronym RAII mean?

WRT extension vs. delegation, as Ken suggested in his comment...for the time being I was thinking of all the typing (and serialization) being done in Java. However, I still think it makes sense to think about how one would (for example) extend a ProtoGroup and be able to override methods in Java and still have things work in C++ API, etc. One thing I hope this would could lead to would be some contribution of the API to the Java World...perhaps pointed in the direction of becoming JRE-add on, etc. Especially since the JCP is changing so much (becoming more open).

KenBirman commented 6 years ago

More quick remarks:

KenBirman commented 6 years ago

RAII: Resource acquisition is initialization

sagarjha commented 6 years ago

Nice to meet you too, Scott! The node_rank that is passed to the constructor is really the id. You need to pass this to the nodes. You also need to pass in their IP address and the leader's IP address. All this suff (automatically generating unique ids, as suggested by Ken, is definitely something we can look to do in the future). The first node that starts must start as the leader and should call the group constructor for the leader. I suggest you refer to typed_subgroup_test.cpp in derecho/experiments for a model implementation of this. Every node checks if my_ip == leader_ip and calls the appropriate constructor. Constructor for the leader (from group.h): Group(const node_id_t my_id, tcp::socket leader_connection, const CallbackSet& callbacks, const SubgroupInfo& subgroup_info, std::vector<view_upcall_t> _view_upcalls, const int gms_port, Factory<ReplicatedTypes>... factories); Constructor for the non-leader just has an additional parameter leader_ip: Group(const node_id_t my_id, const ip_addr my_ip, const ip_addr leader_ip, const CallbackSet& callbacks, const SubgroupInfo& subgroup_info, std::vector<view_upcall_t> _view_upcalls = {}, const int gms_port = derecho_gms_port, Factory<ReplicatedTypes>... factories);

sagarjha commented 6 years ago

Finally, the rank is assigned in the order the nodes join the group. You can start all of them at once or one by one, just that the leader should be started first.