SysFera / vishnu

Modular and high-level middleware for tasks, files and information management in heterogeneous and distributed HPC environments
http://sysfera.github.com/vishnu.html
Other
4 stars 12 forks source link

FMS: stat and ls return "invalid option" on some machines #330

Open rchakode opened 11 years ago

rchakode commented 11 years ago

According to the version of the operating system, the stat and ls commands use in FMS need different options to operate properly. A quick fix has been provided by retrying while using the alternative options if that error error occured. But this is not an elegant solution, it would be for example better to add a field in the machine properties to describe the type of the machine. Another solution is to check the type of the machine at the first connection on the machine.

bdepardo commented 11 years ago

We could use a cache in the daemon to store this information instead of adding it into the database. During the first command execution on a given machine we update the cache, and use it for subsequent calls.

keoo commented 11 years ago

I disagree with the cache solution, unless we keep the database and load the cache from the database, but we should keep it up to date with the database (add some complexity). First, what do you do if the command fails the first time ? You return the first error ? the second error ? you may return the error that does not correspond to the real problem. It only works in a quick patch like the current one but does not seem to be robust for a viable solution. Then it would imply to create a map with all the machines and their system just for a command, it introduces lots of cases and potential error in handling the datastructure right (read and write mode on a multi-threaded server), the database solution (only read) would limit the risk on introducing bugs.

Introducing this field in the database and asking this data in the add_machine service seems a bit problematic because the vishnu administrator may not know the right answer, and asking him such a technical question may seem too much.

What do you call the first connection on the machine ? the user that makes vishnu connect ? the user that uses any service executed on this machine ? the user that uses the first FMS service on this machine ? The user that uses the first stat on this machine ?

I do not have a better idea by now, the one based on the first connexion seems the more viable but needs to be described more precisely

rchakode commented 11 years ago

+1 for this: "Introducing this field in the database and asking this data in the add_machine service seems a bit problematic because the vishnu administrator may not know the right answer, and asking him such a technical question may seem too much"

In the current solution, other errors are not ignored. The system tries the second syntax iff the error occurred is an "illegal option" error.

bdepardo commented 11 years ago

The problem with the database solution is, as you said, that you cannot ask the admin for such info, he may not know the answer. If it is detected at runtime (what I called "first connection", i.e., the first command that is being executed on the machine) then what is the point of setting it in the database? you only need to update the cache "once". Of course this requires a mean to detect accuratly the OS type.

"Then it would imply to create a map with all the machines and their system just for a command, it introduces lots of cases" -> there aren't that many machines in the system. What are the different cases that you think of? We currently have two, and adding a few others might not be such a big deal (we already handle several cases for batch schedulers, so why not for FMS commands?). Currently this is only for one command, but others might have different options on different systems.

Having a thread safe cache shouldn't be that hard (it is basically a thread safe map).

keoo commented 11 years ago

Dirty solution in place and functionnal, set as enhancement now.