DUNE / dist-comp

Action items for DUNE distributed computing, and common scripts that are used.
2 stars 0 forks source link

Consolidated datasets for justIN automated log.tgz files #137

Closed Andrew-McNab-UK closed 3 weeks ago

Andrew-McNab-UK commented 4 months ago

Request

justIN currently creates a MetaCat and Rucio dataset justin-logs:workflow_NNNN for each workflow, with a Rucio rule to put the contents on DUNE_US_FNAL_DISK_STAGE and expire after 7 days. At the end of each job, a tgz file is made containing all the files ending in .log found in the user jobscript's working directory, including jobscript.log .

Originally the default rule attempted to leave the file on a nearby RSE, to which it is initially uploaded to. As the vast majority of these files are never looked at, this saves a lot of unnecessary transfers and occupation of space at Fermilab.

This feature will add a MetaCat/Rucio dataset for each RSE with a rule that says keep the file on that RSE. justIN will add the tgz files to the appropriate dataset depending on which RSE is being used. To mitigate this increase in the number of datasets being created, a single dataset will be created for each time period for each RSE and used by all workflows with jobs that upload to that RSE. It has also been suggested to increase the lifetime to about a month, and that users will be reluctant to use Rucio directly to access job log files.

Implementation

StevenCTimm commented 3 months ago

Technically this seems fine and should lead to less log traffic in FTS3. Only question is--what is the fallback mechanism if the initial "nearby" RSE is full or unavailable, does it fall over to others in the list?

Also, I don't think there has been a full sign-off from the consortium saying that log files in general should be consigned to rucio. This is a significant interface change from being able to go to a web browser and just pull them up, which users currently can do now. If the consortium signs off then this is a good technical way to implement the above plan.

Andrew-McNab-UK commented 3 months ago

Currently the wrapper job tries up to 3 RSEs in case of failures, including being full. The initial list of RSEs to try is also meant exclude full ones, based on the info from Rucio.

For retrieving the log file tgz files, in the future we could provide a browser for them as part of the dashboard in addition to the command line option I mentioned.

It’s a small fraction of them which are actually viewed. So fetch the tgz file with the justIN dunepro ID using the Rucio client API, cache it on the web server machine, give the user a directory listing from within the tgz file, and stream the individual file to their browser out of the tgz file if they click on it.

It should be straight forward to increase the number of dashboard websevers if we need to scale this up in the future too. This would be a significant improvement over the current jobsub web view of the stdout/stderr since the tgz file includes any *.log file they create.

Andrew-McNab-UK commented 3 months ago

The main branch has an implementation of the per-RSE datasets for logs, and of a standalone command justin-fetch-logs which will get the logs.tgz for a job using a Rucio account and an X.509 proxy (and optionally unpack it), and an updated justin command which has a new fetch-logs subcommand, which does the same as justin-fetch-logs but without needing a Rucio account - the downside is that the transfer is done by the central justIN services which do the Rucio download and then pass the logs.tgz file to the user.

This version is deployed on the integration instance.

Andrew-McNab-UK commented 3 weeks ago

This is included in 01.01, currently in production.