DIRACGrid / DIRAC

DIRAC Grid
http://diracgrid.org
GNU General Public License v3.0
113 stars 174 forks source link

[8.0] Change TS addFile logic for Deleted files #7758

Closed arrabito closed 2 weeks ago

arrabito commented 1 month ago

In the current 8.0 release when a file is removed from DFC, if the file was attached to a transformation it gets in 'Deleted' Status in the TransformationFile table. Now, if a new file with the same LFN of the one which has been removed is added again to the DFC and also to the TSCatalog, the current implementation of the addFile method of the TS does not update the Status of this file. As a consequence the new produced file will remain in 'Deleted' Status and will be never be processed by the transformation. The same is true for the setMetadata method of the TS. With this PR we add a check in the addFile and setMetadata methods on the 'Deleted' status of files already attached to the transformation and if the file must be attached to the transformations its Status is changed from 'Deleted' to 'Unused'. For coherence we also update the Status in the DataFile table from 'Deleted' to 'New'.

BEGINRELEASENOTES

*TransformationSystem CHANGE: Check if file previously DELETED is back in DFC

ENDRELEASENOTES

fstagni commented 1 month ago

I have never thought a file could be "re-added" to the catalog.

This check is somewhat "expensive" and I am wondering if there's a better way.

arrabito commented 1 month ago

I have never thought a file could be "re-added" to the catalog.

In some cases we have jobs that failed but which succeded to upload part of the output files. When this happens we simply set to Unused all the input files of the job and a new job is created to process again those files. So when the new job tries to upload its output it will find some existing LFN. We have thus just instructed the job to remove the existing file and upload the new one. We should probably improve this logic which is not optimal, but the scenario of re-add a file could happen anyway since it's not forbidden.

This check is somewhat "expensive" and I am wondering if there's a better way.

I haven't thought at a better implementation, but what about just removing the file also from the transformations when it's removed from the catalog instead of changing its status to 'Deleted'?

fstagni commented 1 month ago

In some cases we have jobs that failed but which succeded to upload part of the output files. When this happens we simply set to Unused all the input files of the job and a new job is created to process again those files.

This, I would argue, is not the best approach. The better way would be to set requests, this is done for you by https://github.com/DIRACGrid/DIRAC/blob/f18be6f9d27f5055512f2c372718734270a060ae/src/DIRAC/DataManagementSystem/Client/FailoverTransfer.py#L166 which created exactly for this purpose.

So when the new job tries to upload its output it will find some existing LFN. We have thus just instructed the job to remove the existing file and upload the new one. We should probably improve this logic which is not optimal, but the scenario of re-add a file could happen anyway since it's not forbidden.

The scenario is not forbidden but still feels "odd".

This check is somewhat "expensive" and I am wondering if there's a better way.

I haven't thought at a better implementation, but what about just removing the file also from the transformations when it's removed from the catalog instead of changing its status to 'Deleted'?

Might be an option, but I think using the FailoverTransfer method is less error-prone.

arrabito commented 1 month ago

In some cases we have jobs that failed but which succeded to upload part of the output files. When this happens we simply set to Unused all the input files of the job and a new job is created to process again those files.

This, I would argue, is not the best approach. The better way would be to set requests, this is done for you by

https://github.com/DIRACGrid/DIRAC/blob/f18be6f9d27f5055512f2c372718734270a060ae/src/DIRAC/DataManagementSystem/Client/FailoverTransfer.py#L166 which created exactly for this purpose.

I'm not sure that we are talking exactly of the same scenario. Let me give an example. We have job_1 which process files: infile_a, infile_b and which should produce: outfile_a, outfile_b. For some reasons (e.g. bug in the application) the job is able to process infile_a but not infile_b, so that it uploads only outfile_a and it gets 'Failed'. Since for our jobs we use the FailoverRequest module, e.g.:

        job.setExecutable(
            "/bin/ls -l", modulesList=["Script", "FailoverRequest"]

this takes care of setting to Unused all input files of failed jobs, in this case: infile_a, infile_b.

The new job will then process again both infile_a, infile_b which results in the scenario I've described.

What do you suggest in this case?

fstagni commented 1 month ago

If the job failed (because it did not manage to process all input files), why is that same job uploading the (partial) outputs? It shouldn't even try the first upload.

arrabito commented 1 month ago

If the job failed (because it did not manage to process all input files), why is that same job uploading the (partial) outputs? It shouldn't even try the first upload.

It's because we have implemented the upload of the outputs as an additional step inside the job, e.g.:

job.setExecutable('app_1')
job.setExecutable('upload_outs')

However you are right, if 'app_1' fails we could instruct for instance app_1 to remove the partially produced files so that 'upload_outs' has no file to upload. I also take the opportunity to ask you some fundamental questions about the DIRAC job execution logic.

Why if the 'app_1' step fails, the 'upload_outs' step is executed anyway?

fstagni commented 1 month ago

Why if the 'app_1' step fails, the 'upload_outs' step is executed anyway?

Within the current workflow system there is no way to prevent one step from being run. This means that every worfkflow module code should start with

https://github.com/DIRACGrid/DIRAC/blob/f18be6f9d27f5055512f2c372718734270a060ae/src/DIRAC/Workflow/Modules/ModuleBase.py#L381

arrabito commented 3 weeks ago

Why if the 'app_1' step fails, the 'upload_outs' step is executed anyway?

Within the current workflow system there is no way to prevent one step from being run. This means that every worfkflow module code should start with

https://github.com/DIRACGrid/DIRAC/blob/f18be6f9d27f5055512f2c372718734270a060ae/src/DIRAC/Workflow/Modules/ModuleBase.py#L381

I see. We don't have custom modules, we just use: Script and FailoverRequest in this way:

job.setExecutable("app1")
job.setExecutable("upload_outs")
job.setExecutable("ls", modulesList=["Script", "FailoverRequest"])

So from your answer I understand that we should better create custom modules for each step and use checkWFAndStepStatus in each of them.

I'm going to try that.

fstagni commented 2 weeks ago

You can fix the existing modules if there are errors in them, creating custom ones it is possible but a slight pain.

arrabito commented 2 weeks ago

Hi @fstagni I've tried to play with custom modules and indeed I think it's the way to go for us. We will change our job logic and hopefully this scenario shouldn't happen anymore or very rarely. So, I let you decide if you want to merge this PR or not. Anyway, just a last question, why the use of:

def _checkWFAndStepStatus(self, noPrint=False):

isn't the default? I mean also in the Script module?

Thank you.

fstagni commented 2 weeks ago

If you do not need this PR anymore, I would prefer to not merge it.

The use of _checkWFAndStepStatus is not the default because of some historical reason, but also because sometimes you just need to run all the modules.

arrabito commented 2 weeks ago

ok fine for me. Thank you.