dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
277 stars 133 forks source link

make psu ls pool -l useful #1966

Closed calestyo closed 8 years ago

calestyo commented 8 years ago

Hey.

Right now, psu ls pool -l gives something like:

lcg-lrz-dc62_11  (enabled=true;active=22;rdOnly=false;links=0;pgroups=0;hsm=[];mode=enabled)
 linkList   :
lcg-lrz-dc08_0  (enabled=true;active=19;rdOnly=false;links=0;pgroups=1;hsm=[];mode=enabled)
 linkList   :
lcg-lrz-dc08_1  (enabled=true;active=18;rdOnly=false;links=0;pgroups=1;hsm=[];mode=enabled)
 linkList   :

Which is already everything in psu ls pool -a, which gives something like this:

lcg-lrz-dc03_0  (enabled=true;active=19;rdOnly=false;links=0;pgroups=0;hsm=[];mode=enabled)
 linkList   :
 pGroupList : 
lcg-lrz-dc03_1  (enabled=true;active=19;rdOnly=false;links=0;pgroups=0;hsm=[];mode=enabled)
 linkList   :
 pGroupList : 

From my personal experience, especially the additional data (linkList) is nothing one needs very often in daily business. However it would be much better to have a more structurised output of all pools. Since linkList is already in -a I'd propose to change -l to output something like this:

lcg-lrz-dc62_11   (enabled=true; active=22; rdOnly=false; links=0; pgroups=0; hsm=[]; mode=enabled)
lcg-lrz-dc08_0    (enabled=true; active=9;  rdOnly=false; links=0; pgroups=1; hsm=[]; mode=enabled)
lcg-lrz-dc08_1    (enabled=true; active=18; rdOnly=false; links=0; pgroups=1; hsm=[]; mode=enabled)

The important part here is, that everything should be aligned in colums. Also, add a space after the ";", this makes things much better readable and one needs spaces anyway for the alignment.

Last but not least it would be awesome if this could use colours... E.g. a pool that isn't enabled,.. then the enabled should be printed red... or if the ping times of a pool are higher than say the medium of all ping times + 25 percent... make the ping time orange... if +50% red. if a pool has neither links nor pgroups (i.e. both = 0) make both numbers orange (=warning).

And for any pool, for which one of the "non-standard" conditions from above applied.. make the name of the pool itself in some colour... either just some grey or if one of the conditions was a warning, make it orange,... if one was a more serious thingy... (i.e. pool down) make it read either.

Cheers, Chris.

calestyo commented 8 years ago

(obviously one could apply some of the ideas herein for -a as well ... and obviously, there should be no colours if the shell is connected to a non-terminal)

gbehrmann commented 8 years ago

Hi,

I agree that there are many commands that should be cleaned up. If we are going to change this, wouldn't it be better to go in the direction of what the new space manager does?

The reason the output used to be more geared towards programatic parsing rather than human readability was because the output was used by other parts of dCache, pcells or scripts. We have to check that at least pcells doesn't use it (although I fear it does).

Colors in cell commands is currently not an option - we don't have the infrastructure for that (yet).

/gerd

calestyo commented 8 years ago

Well than lets keep this here as an enhancement idea for the records...

What do you mean by "direction that SM goes"? In any case I wouldn't mind if the command names would be cleaned up (one should really be able to skip the "psu")... an perhaps things like allowing more parameters psu addtoo pgroup pool1 pool2 pool3... Or even better... set pgroup +pool1 +pool2 -pool3

Many possible ideas ;-)

gbehrmann commented 8 years ago

@maswan actually asked for multiple arguments for all psu commands too today :-)

What I meant regarding the SM was aligned column output with a title line in capital letters and no field labels on every row.

calestyo commented 8 years ago

Ah I see... well, I have no general objection, but the things should stay parsable (unless the longterm plan is to offer a simple-to-use interface too all that data for the most commonly used languages, shell, C, python, perl). People will want to do scripting (e.g. I read the space manager's ls data, and automatically adjust the size of tokens, because that always seems to fluctuate,... and also that things work automatically when I add remove pools (each of our pool belongs to exactly one token))... other examples would be Nagios checks (which hopefully come before the physicists fulfil their in-50-years fusion-energy promise ;)

So if it's done like in the space manager than it's important, that the columns are fixed (even if there's an empty field, there should be something like "") and that if order or whitespaceing around changes, this is loudly propagated in the release notes =)

calestyo commented 8 years ago

And as for the changing of command syntax...

I really haven't spent much thinking about that, but when you actually should start efforts in that area, than I think it would be good to tell people on the user-forum before you actually implement.

Coming up with a good schema that is generic and powerful is probably difficult... especially when the long term goal is that one can use those commands also from outside the admin shell. (You remember our talk in amsterdam about the idea to export admin shell completion to the normal shell? So that people can mix up their commands like $ dcacheadmin \sp ls pool | grep pool[123] *| sed "s/^/\sp set pool /; s/$/ rdonly/" | dcache_admin In order to set all pools matching pool1, pool2 and pool_3 rdonly (and note that this would be executed in bash).

So to get something really simply/generic/powerful here, I think it's the best if many people have an eye over proposals :)

gbehrmann commented 8 years ago

Ah I see... well, I have no general objection, but the things should stay parsable (unless the longterm plan is to offer a simple-to-use interface too all that data for the most commonly used languages, shell, C, python, perl).

That's an inherent problem with wanting to make things readable. I see parsing the output of these commands beyond one-off scripts as an anti-pattern. It's always a workaround for some other problem.

e.g. I read the space manager's ls data, and automatically adjust the size of tokens, because that always seems to fluctuate,... and also that things work automatically when I add remove pools (each of our pool belongs to exactly one token

  • what exactly is fluctuating? I mean, you have an agreement with your customer on the size of a reservation [you reserved X bytes for them which isn't going to be used by other people - that's what a reservation is], how does it fluctuate?
  • funny how people end up using space reservations in exactly the way we supported these things before we had space reservations in dCache and before everybody told us that a fixed allocation of pools bound to a particular directory was not good enough (because that's what we had before space reservations).

For the record, we don't have pools for particular reservations - we cannot be bothered making new pools whenever a customer comes up with a new reservation they want. We got pools assigned for a particular VO and if they want us to prereserve some space in a reservation for them, we happily do that, but it is taken from the entire set of pools assigned to that VO. That of course also means it takes us a minute to create or remove a reservation for them as it is just a single command in space manager.

other examples would be Nagios checks (which hopefully come before the physicists fulfil their in-50-years fusion-energy promise ;)

Don't know what you mean by "come" - are you expecting us to write site Nagios checks? I mean, NDGF has plenty of Nagios checks against our dCache - all specific to our unique installation.

So if it's done like in the space manager than it's important, that the columns are fixed (even if there's an empty field, there should be something like "") and that if order or whitespaceing around changes, this is loudly propagated in the release notes =)

I do not consider that particularly readable. I don't know of any other shell program that would put "" into a column if there was no value. There is a standard comment in most release notes saying that output of some commands can have changed. If people run scripts against an interface intended for humans, they have to expect that it may fail between feature releases. It is no different than if people scrape HTML - it is going to fail periodically. Now, this doesn't mean that doing stuff like what you mentioned (piping output through other utilities) shouldn't be done - to the contrary, that's just fine and too works just fine with regular shell utilities.

I would expect you of all people to understand. The arguments you bring forward now are exactly the arguments that have been used all these years NOT to change any of those many things you have asked for over the years. :-)

especially when the long term goal is that one can use those commands also from outside the admin shell.

Yes I do remember, no, I don't like it, and I have no knowledge of this being a long term goal in any way. What you describe can already be done easily (the dcache_admin command you describe is easily definable as a shell alias) and I expect many have something like that. I don't see how anything needs to change in the admin shell for that.

So to get something really simply/generic/powerful here, I think it's the best if many people have an eye over proposals :)

I am uncertain exactly about what proposal you are talking about. We are not going to run every minor change in admin commands through user-forum. If people are interested, they can follow development in reviewboard and on github. I have run some stuff through user-forum in the past and will continue to do so, but I have no idea what you mean by "schema" in this context.

So if it's done like in the space manager than it's important

What is important to me is that it is consistent and easy to use interactively (i.e. readable). I don't want every service having its own weird output format.

In conclusion, I will probably close this issue as won't fix - since you yourself are relying on parsing the output of these commands, there is no point in changing them now. Similarly, changing options, command prefixes, etc. just to make them nicer is going to break scripts, only for the sake of making things easier to read and use for humans (which I would like, but you and others are arguing against doing that). If all goes according to plan, the entire pool manager will be dropped within the next year or two and it's all gone anyway.

calestyo commented 8 years ago

On Mon, 2015-11-23 at 13:33 -0800, Gerd Behrmann wrote:

Ah I see... well, I have no general objection, but the things should stay parsable (unless the longterm plan is to offer a simple-to-use interface too all that data for the most commonly used languages, shell, C, python, perl). That's an inherent problem with wanting to make things readable. I see parsing the output of these commands beyond one-off scripts as an anti-pattern. It's always a workaround for some other problem. I don't think it's impossible to make readable stuff parsable..  s with the space manager, it works quite well.. Just make sure to communicate to people when things change.

  • what exactly is fluctuating? I mean, you have an agreement with your customer on the size of a reservation [you reserved X bytes for them which isn't going to be used by other people - that's what a reservation is], how does it fluctuate? Sorry, it's actually the free space in the link group that fluctuates... I never really found out where that comes from... parts may be that the filesystem eats up space for metadata.. but I rather had the impression that other reasons caused that as well, as sometimes I saw negative and positive values in the available space of the LGs.

  • funny how people end up using space reservations in exactly the way we supported these things > before> we had space reservations in dCache and before everybody told us that a fixed allocation of pools bound to a particular directory was not good enough (because that's what we had before space reservations). Well... don't blame me for stuff that ATLAS wanted... ;-) I know several sites which don't have this fixed mapping as we have it... and every time they've had pool losses or so, they had much more troubles in cleaning up.

For the record, we don't have pools for particular reservations - we cannot be bothered making new pools whenever a customer comes up with a new reservation they want. We got pools assigned for a particular VO and if they want us to prereserve some space in a reservation for them, we happily do that, but it is taken from the entire set of pools assigned to that VO. That of course also means it takes us a minute to create or remove a reservation for them as it is just a single command in space manager. With "we" you mean you at NDGF? Or what exactly are you talking about? What we here do is simply, that we have one link group per space token, and then each pool is in only one LG.

other examples would be Nagios checks (which hopefully come before the physicists fulfil their in-50-years fusion-energy promise ;)

Don't know what you mean by "come" - are you expecting us to write site Nagios checks? I mean, NDGF has plenty of Nagios checks against our dCache - all specific to our unique installation. Well it's already some time ago, when we had a dCache family meeting in HH, we talked about nagios checks,... especially powerful generic ones which work for "any" setup. I was thinking about something that should produce results like: http://my-plugin.de/wiki/projects/check_multi/screenshot I.e. these tree views are IMHO quite nice. So you could have things like: check_dcache_pool which then shows a tree of all the pools on that node... and with sub trees that display certain monitored properties of that pool, while the over OK/WARNING/ERROR just tells whether any pool failed. Similar for the doors or core services like poolmanager. Crucial IMHO was, that such checks would need to dynamically scale to newly added pools/doors. E.g. I wouldn't want to configure separate nagios checks for pool1, 2 and 3 on host A... I just would want to say... check pools.... and it automatically sets up such a combined check for alle existing pools. In the meantime this can even be done for performance data,... so one can in principle write a check which returns multiple perfdata records and which is then understood by e.g. PNP4Nagios to get history plots. All quite nice.. but a lot work...

So if it's done like in the space manager than it's important, that the columns are fixed (even if there's an empty field, there should be something like "") and that if order or whitespaceing around changes, this is loudly propagated in the release notes =)

I do not consider that particularly readable. I don't know of any other shell program that would put "" into a column if there was no value. There is a standard comment in most release notes saying that output of some commands can have changed. If people run scripts against an interface intended for humans, they have to expect that it may fail between feature releases. It is no different than if people scrape HTML - it is going to fail periodically. Now, this doesn't mean that doing stuff like what you mentioned (piping output through other utilities) shouldn't be done - to the contrary, that's just fine and too works just fine with regular shell utilities.

I would expect you of all people to understand. The arguments you bring forward now are exactly the arguments that have been used all these years NOT to change any of those many things you have asked for over the years. :-) Well you should remember that I'm always the lone one who says: change as many things as you want if it's for cleaner design... I even say that there's no need to write compatibility wrappers and stuff, which will sooner or later get dropped anyway and thus admins will sooner or later have to do the work. The only thing I say: document it properly. And my above previous that this needs to be done wasn't a complaint - actually it worked quite well in the last time, that any such changes were properly documented in the release notes.

especially when the long term goal is that one can use those commands also from outside the admin shell.

Yes I do remember, no, I don't like it, and I have no knowledge of this being a long term goal in any way. What you describe can already be done easily (the dcache_admin command you describe is easily definable as a shell alias) and I expect many have something like that. I don't see how anything needs to change in the admin shell for that. Ah? I though you were quite fond of that idea. No nothing really needs to change in the admin interface... well except perhaps what I asked for in this ticket,... because -l outputs e.g. all kinds of stuff which one regularly wouldn't need so I have to do additional grep'ing on the shell to get these lines away. Plus one would need to find out, whether the admin shell completion could somehow be used by bash completion. Well I think it's better to re-design the output of commands if this seems better, and let people adapt their scripts than to have an ever growing number of stuff depending on formats that have quite some space for improvement. Plus after all, commands in the admin shell should be primarily intended for interactive use (which of course doesn't mean one should make parsing unnecessarily difficult), so that should be the main motivation, and here I think my suggestions weren't too bad. Cheers.

calestyo commented 8 years ago

(apparently, replying via email didn't really work as it should... sry)

gbehrmann commented 8 years ago

With "we" you mean you at NDGF? Or what exactly are you talking about? What we here do is simply, that we have one link group per space token, and then each pool is in only one LG.

Yes, I meant NDGF. It wasn't meant as if there is anything wrong with having fixed pools per reservation. Just saying that we don't.

The amount of free space in a link group as reported in pool manager is going to fluctuate as files get uploaded and deleted. Also the values in space manager are going to fluctuate as there is a delay between files getting added in space manager and the data actually being written (and free space being consumed) on the pools. These are however short term fluctuations and certainly should not prompt you to adjust any reservations - things are eventually consistent.

As for Nagios checks, it's outside this ticket. I am not aware of any activity within dCache.org partners to write such checks (I have asked for it in the past though).

Ah? I though you were quite fond of that idea.

Oh, you meant just the completion. I though you were talking about entirely dropping the ssh admin shell and replace it with commands invoked from the regular system shell. I suggest you create a specific enhancement request for that and assign it to Karsten - he maintains the bash completion script.

gbehrmann commented 8 years ago

FYI: dCache calls https://docs.oracle.com/javase/8/docs/api/java/net/InetAddress.html#getLocalHost-- to determine the address of the local host:

Returns the address of the local host. This is achieved by retrieving the name of the host from the system, then resolving that name into an InetAddress.

This is why if you point the name of the host to the loopback interface, dCache will consider the address of the host to be the loop back.

When I googled the issue about pointing the host name to the loopback, the top hit was an email you sent to the Debian folks. In the replies they suggested not to bind the host name to any IP in /etc/hosts and instead rely on DNS doing its job. Doesn't that work with dCache?

gbehrmann commented 8 years ago

I would also like to quote https://www.debian.org/doc/manuals/debian-reference/ch05.en.html#_the_hostname_resolution:

For a system with a permanent IP address, that permanent IP address should be used here instead of 127.0.1.1.

calestyo commented 8 years ago

I knew that there are these small fluctuations due to uploads/deletions... though in the past (>2 years) I've seen LGs for which that out-of-sync grew bigger and never seemed to recover... bigger in the sense of a few gigs... perhaps some other inconsistencies... But nowadays that adjustment tool is mainly to make my life easier... I simply add/remove pool, and the tokens take that up automatically =)

No I don't think we should drop the admin shell itself... though I personally wouldn't need it to be SSH... if people want remote shell access they can always tunnel anything via openssh, and that's typically more secure and well maintained than some Java SSH implementation =)

As for the completion stuff, done in #1969

As for the address determination... I know that some systems do it like that, and apparently Java even advertises it ^^ but it's just a hack, nothing guaranteed to work (as already mentioned in the other ticket). The local hostname is not guaranteed to resolve, and especially not to the global address (even though many systems do so for historic reasons)... actually if it does (resolve to the global address) it causes quite a number of other issues, which is why Debian already changed that years ago. Plus, with multihomed systems or dual stacked systems, the situation is even worse. Or just think about NATed systems

I'm not sure in which situation dCache needs to know it's own address (cause normally it should be simply able to use the destination address of the incoming request?) so I cannot really say what's the appropriate solution. But virtually any other daemon I know, either requires people to specify to which addresses they want to bind (and use these as the global addresses) or use something like STUN, TURN, ICE.

Regarding your quote,.. that seems a bit outdated... or at least it's not practically not done anymore. In Debian there's unfortunately some kabal which pushes towards either network manager (which I think doesn't set the hostname and thus the one pointing to 127.0.1.1 would be used) and another kabal pushing towards systemd-networkd (which AFAIC relies on libnss-myhostname, which AFAIC uses 127.0.0.1 unless anything else has been specifically configured).

Cheers, Chris.

gbehrmann commented 8 years ago

So you are saying the main Debian use manual is out of date?

There may not be a guarantee, but it works everywhere, so good enough for now.

kofemann commented 8 years ago

most of serviced in dcache supports listen option to bind as well as you can force local name by

dcache.java.options.extra=-Dorg.dcache.net.localaddresses=host.name.to.expose

gbehrmann commented 8 years ago

@kofemann, yeah, I think the complaining is about the default behaviour if you didn't configure anything.

maswan commented 8 years ago

There is no better guess possible than the hostname, and that will work fine in the majority of situations. We make sure to configure a proper FQDN hostname on our (ndgf.org and hpc2n.umu.se) servers so that that'll work, but on a mobile laptop 12.0.0.1 is more likely.

If the default doesn't work, you'll have to specify one in dcache.java.options.extra, so I don't really understand what the problem is. Maybe setting hostname is a bit obscure in the documentation, possibly?

gbehrmann commented 8 years ago

This thread is intermixed with the thread in #1958. Those discussions should happen on user-forum where other admins may comment on how to properly configure a server. As it is, this issue is no longer descriptive of a particular problem and will be closed as invalid.