CMSCompOps / WmAgentScripts

CMS Workflow Team Scripts
7 stars 51 forks source link

Reduce EOS utilization in Unified #352

Open drkovalskyi opened 5 years ago

drkovalskyi commented 5 years ago

EOS is used in many places in Unified, both as input and output. Unfortunately even with the fuse mount EOS cannot be used as a proper file system. CERN IT aims to reach this goal and use it as a replacement for AFS, but we are not there yet. So we have to use EOS as a mass-storage with unavoidable delays and interruptions. The current setup causes many operational issues. To avoid them we need to make sure that the core of Unified doesn’t depend on access to EOS and uses local space, AFS and CephFS instead (CephFS is share between all Unified servers and the current quota is 10TB). We can address the problem in multiple ways:

  1. Remove EOS from Unified wherever it’s simple to do. Example: cWrap.sh - make it using only local logs and rely on dedicated cron jobs to rsync logs to EOS (non a critical service).
  2. Reduce EOS utilization in each component one by one. As soon as a component can function without EOS make EOS non-mandatory in the constructor of componentInfo used in the component.
  3. Move status files to a shared file system where needed and remove utils.eosRead and utils.eosFile

@bbockelm, @vlimant, @amaltaro any feedback is welcome.

amaltaro commented 5 years ago

Dima, I think it would be beneficial to first identify the following, such that you/we can choose the best approach here. Questions like:

BTW, what are those "status files" that you mentioned on point 3.?

drkovalskyi commented 5 years ago

I'm collecting the cases. It will take time. EOS over http or xrootd is certainly an option to consider.

bbockelm commented 5 years ago

@drkovalskyi -

A few thoughts come to mind:

  1. We probably need to distinguish "making bulk logs available" from cases where daemons want to share data (i.e., really need a shared filesystem).
  2. Have a configurable strategy for each. For example, we might want to copy to EOS-fuse via rsync today and decide to switch over to xrdcp or curl (or "the next great thing") in the future. We definitely want to be able to configure the shared filesystem prefix just in case if CephFS crashes-and-burns and we have to revert to AFS.
  3. Do we have the appropriate file locking to make sure multiple cronjobs don't pile up? On the grid side, we've been trying to abandon cronjobs en-masse because systemd services (even if it's a user-level one) allow better scheduling of "cronjobs". I think we all hate logging in to a server that has load 1,000 due to a runaway cronjob copying from a wedged filesystem...