DUNE / dist-comp

Action items for DUNE distributed computing, and common scripts that are used.
2 stars 0 forks source link

Numbered output datasets in justIN workflows #169

Open Andrew-McNab-UK opened 5 months ago

Andrew-McNab-UK commented 5 months ago

This feature proposal came out of the DM meeting earlier this week and previous discussions at the collaboration meeting. The idea is that to limit the number of files governed by a particular Rucio rule, justIN could limit the size of the output Rucio/MetaCat datasets it creates by creating a series of numbered datasets. The need for a similar mechanism for output to Fermilab dCache scratch has emerged this week, due to limitations on the handling of directories with thousands of files.

Currently in justIN 01.01 outputs are defined by an option similar to --output-pattern something-*.root:NAME or --output-pattern somethng-*.root:https://fndcadoor.fnal.gov:2880/dune/scratch/users/amcnab where the first colon separates the pattern used to find the files on disk on the worker name from the destination Rucio/MetaCat dataset or WebDAV URL. If the Rucio dataset does not exist, it is created by justIN. If no destination is given then justIN constructs a dataset name of the form wXsYpZ where X is the workflow ID, Y is the stage ID (usually 1), and Z is the pattern ID counting from 1 (to allow for multiple --output-pattern options.)

The proposal is that in the next release, justIN will now take the responsibility of creating all output datasets. If the destination is not given, then a series of Rucio/MetaCat datasets of the form wXsYpZn0001, wXsYpZn0002, ... will be created and used in turn as needed. The datasets this will be created during the lifetime of the workflow by justIN observing when another one will soon be needed. The size of these numbered datasets will be a global parameter set in justIN's configuration.

However, if a Rucio dataset destination is given, justIN will use that as a prefix for a string with the same structure. For example --output-pattern something-*.root:NAME will lead to datasets with names like NAME-wXsYpZn1, NAME-wXsYpZn2, ...

If the output destination is given as a WebDAV URL, directories of the form https://fndcadoor.fnal.gov:2880/dune/scratch/users/amcnab/X/Y/Z/N will be created, where X is the workflow ID, S is the stage (usually 1), Z is the pattern ID counting from 1, and N is a subdirectory number counting from 1. This ensures that if different patterns give the same WebDAV URL prefix, the files are actually written into different numbered subdirectories, keeping the directory size down to the global limits set in justIN's configuration.

The full name of the numbered output dataset or numbered output WebDAV URL is needed by the justIN wrapper jobs during the outputting phase after the user's jobscript has run and created the output files on the worker node local disk. The wrapper job sends a message to the justIN Allocator service with the names of the output files found for each output pattern. The allocator replies with the output Rucio dataset or WebDAV URL prefix to use. In this proposal, those names will take the numbering of subsets into account and tell the wrapper job to output into the correct subset. This will be calculated by the Allocator by looking at the number of outputs already successfully output for that pattern.

StevenCTimm commented 5 months ago

Comments as follows:

1) I renew the request from Justin 1.0.1 to have some way to at least programatically discover these dataset names short of grepping all rucio rules.. it is already an issue to find the existing ones and if they are further subdivided it will be more so. I am not sure if this would or could be done with a Justin API call, but the idea would be that for a given JustIN workflow to know all the (top level) data set names that it created and if they are as yet complete.

2) all of the above work although necessary, may not be sufficient. In particular tape paths currently don't have any way to prevent against more than 10K files in a run. This is not JustIN's problem but it does have to be addressed on the Rucio side.

Andrew-McNab-UK commented 5 months ago

For 1, the datasets are now always going to have names that match 'wXsYp' where X is the workflow ID and Y is the stage ID. So something like rucio list-dids --filter type=dataset 'SCOPE:*wXsYp*' should find them all. (You could remove any false positives by piping the output through grep too, in case people are manually reusing the ...wXsYpZ... names.)

Alternatively, the datasets all have MetaCat metadata too giving the workflow_id they are from. So that's another way to find them.

I'm quite reluctant to provide a way of getting this state from justIN rather than making sure it's discoverable from what we permanently store in Rucio and MetaCat.

StevenCTimm commented 5 months ago

As it was discussed in todays' meeting it might be wise to have JustIN tag the metacat dataset with some kind of a flag when the processing phase is done.

StevenCTimm commented 1 week ago

Looks like this feature is implemented now? At least the number-split rucio datasets are working. not sure about the scratch directories.