cbm-fles / flesnet

CBM FLES Timeslice Building
7 stars 22 forks source link

FLES_libfabric with Fault Tolerance #74

Closed fsalem closed 3 years ago

cuveland commented 4 years ago

It all looks very neat and tidy. Thanks for the great job!

I have no comments on the vast majority of changes within lib/fles_libfabric. This falls within your area of responsibility, and I did not follow every line.

What is new to me is the dependence on MPI. I cannot yet foresee all the implications of this global change. I think it would make sense to continue to support systems without MPI at least to the same extent as before. I have nothing against it in principle, but can we make it optional in the build process for now? It could be coupled to something like CMake's USE_LIBFABRIC.

A few more questions about MPI: What is MPI actually used for? Only for the barrier? The use of MPI changes the life cycle of the processes significantly. What happens when a process ends? Can the system still shut down correctly, including the handling of shared memories? How can the system survive the failure of a computing node when using MPI?

In lib/fles_ipc/System.cpp, I think you might have introduced an error. As the documentation states (see https://www.man7.org/linux/man-pages/man2/gethostname.2.html) for gethostname: "On success, zero is returned. On error, -1 is returned, and errno is set appropriately." This is also true for getdomainname.

The additional flesnet parameters introduced are for the most part well structured.

The "scheduler-*" parameters obviously target the behaviour of the new scheduler. To prevent any confusion, I think it would make sense to add a remark like "libfabric only" to the parameter description.

In the case of "drop-process-ts" I am not sure how this is to be used. Is this also libfabric-only? If so, we should definitely label it as such. Or do you envision that this would also make sense to use in the other transports?

I'm afraid that the two parameters "log-directory" and "enable-logging" could lead to confusion. I would suggest leaving them out. We already have the global parameters "log-level" and "log-file", which refer to the global log system, which can also be redirected to syslog, for example, or later possibly to another monitoring system. If at all possible, I would suggest to use this existing system here as well. As an idea: instead of several files, one could, for example, simply differentiate using an additional column, which is then filtered on in the evaluation.

About logging and console output... slowly moving towards a scalable production system we have to be careful here. I think in general we should avoid using any console output that bypasses the log system.

In lib/fleslibfabric/Provider.cpp, fprintf(stdout, ...) is used extensively for a debug dump. The "printf" family of functions is not used anywhere else in the flesnet source code, and also std::cout/std::cerr is used only in rare circumstances. Could you use L(trace) or something similar instead here?

I think with these few changes, we should be able to merge the branches successfully in a very short time.

fsalem commented 4 years ago

Thanks Jan @cuveland for your comments. I updated the code to consider most of your comments. Please find below my comment to each of your points:

Please let me know your comments about the latest changes.

cuveland commented 3 years ago

Hi @fsalem,

Thanks for reminding me of this merge request. In preparation for our meeting on Monday I revisited the changes. I think your updates resolve the comments. Thank you!

fsalem commented 3 years ago

Hi @cuveland,

Thanks so much :-)

fsalem commented 3 years ago

Hi @cuveland,

'fles_libfabric_DFS_01AUG20' contains now the latest commits of the master branch. Please let me know if I should update anything else.