dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
291 stars 136 forks source link

resilience: improve documentaion of options #3310

Closed calestyo closed 7 years ago

calestyo commented 7 years ago

Hi.

I was having a look at the .properties documentation of 2.16, and some things with respect to the resilience service remained pretty unclear to me, even after having again a look at the slides from Umeå.

First, in pnfsmanager.properties, there is this snipped for pnfsmanager.db.connections.max:

#  NOTE:  when running resilience embedded here, this number should
#         be increased. The recommended minimum setting would be
#
#               pnfsmanager.resilience.submit-threads
#             + pnfsmanager.resilience.pnfs-op-threads
#             + pnfsmanager.resilience.db.connections.max
#             + whatever maximum allowed for normal namespace settings
#               (default = 30)
#
#       Submit and pnfs threads require 1 database connection, and scan
#       threads need 2.
#
#       Be sure to adjust postgresql.conf max connections to allow
#       for the larger value here, plus the added pool scan
#       connections specified by pnfsmanager.resilience.db.connections.max.
#

What is actually meant here? What is "resilience embedded"? Does it mean the resilience service? And in which sense embedded? Running resilience in some special embedded mode (couldn't find anything about that) or simply having that service within the cluster?

Also the mentioned pnfsmanager.resilience.* do not seem to exist anywhere. And "whatever maximum allowed for normal namespace settings" should probably be clarified as well... e.g. is it just pnfsmanager.db.connections.max or also what NFS doors open to chimera? etc. pp... Also, how does it relate to resilience.db.connections.max ?

Then there are these:

# ---- File-system-related properties.  These mirror the normal
#      namespace service setup.
#
resilience.plugins.storage-info-extractor=org.dcache.chimera.namespace.ChimeraOsmStorageInfoExtractor
(one-of?ONLINE|NEARLINE)resilience.default-access-latency=NEARLINE
(one-of?CUSTODIAL|REPLICA|OUTPUT)resilience.default-retention-policy=CUSTODIAL
resilience.enable.inherit-file-ownership = false
resilience.enable.full-path-permission-check=true
resilience.enable.acl = false

And to me it's really unclear what they do. First, if they mirror the normal settings, why not taking them, i.e.: resilience.enable.acl = ${pnfsmanager.enable.acl} and so on.

Is it strictly necessary to keep them in sync with their pnfsmanager.* counterparts? What happens if one doesn't?

E.g. if resilience.enable.acl = false, does that mean that files copied by resilience won't get ACLs? Similar, does resilience.enable.inherit-file-ownership = false that the don't inherit file ownership (even if for normal operations this would be done due to pnfsmanager.enable.inherit-file-ownership = false)?

And what do resilience.default-access-latency and resilience.default-retention-policy exactly do here? Is it which repol/acclat files copied by the resilience service get? Or files which it considers for duplication?

Our site for example doesn't have tape, so I set:

dcache.conf:pnfsmanager.default-retention-policy=REPLICA
dcache.conf:pnfsmanager.default-access-latency=ONLINE

so would I need to set the resilience.* options to that as well?

Cheers, Chris.

btw: Parts of the options above miss the (one-of?...) and these should probably then also contain their respective pnfsmanager.* counterparts, i.e. not just (one-of?ONLINE|NEARLINE)resilience.default-access-latency=NEARLINE but: (one-of?ONLINE|NEARLINE|${pnfsmanager.default-access-latency})resilience.default-access-latency=NEARLINE (of course this would no longer be needed if #3309 was fixed ;-) )

alrossi commented 7 years ago

(a) that whole embedded bit is left over from an earlier prototype. In the released version, it is not the case that we allow resilience to run as part of the pnfsmanager. So that whole section just needs to be removed.

(b) the 'mirrored' properties: you are right, these should actually be service-only, should not be subject to reconfiguration here. They actually should be eliminated from the properties file. Resilience uses a namespace provider but its interactions with it are read-only, so the defaults are fine.

BTW, pnfsmanager.default-retention-policy actually references dcache.default-retention-policy

if you are running resilience, once I make the change, you should probably redefine the global default rather than pnfsmanager. It will not really matter that much in the end, because, as I said, setting that property is necessary but it has no bearing on resilience's behavior, which is always to retain the access latency and retention policy of the file being copied.

the alteration to the one-of will not be necessary once I remove these properties from the resilience file.

alrossi commented 7 years ago

https://rb.dcache.org/r/10324/ https://rb.dcache.org/r/10325/

QUESTION: should the resilience storage extractor also be immutable and simply reference the dcache property? Is there a use case when the global default would be different from the pnfsmanager setting?