aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
431 stars 187 forks source link

Engine: Fix bug introduced when refactoring `upload_calculation` #6348

Closed sphuber closed 4 months ago

sphuber commented 5 months ago

In 6898ff4d8c263cf08707c61411a005f6a7f731dd the implementation of the processing of the local_copy_list in the upload_calculation method was changed. Originally, the files specified by the local_copy_list were first copied into the SandboxFolder before copying its contents to the working directory using the transport. The commit allowed the order in which the local and sandbox files were copied, so the local files were now no longer copied through the sandbox. Rather, they were copied to a temporary directory on disk, which was then copied over using the transport. The problem is that if the latter would copy over a directory that was already created by the copying of the sandbox, an exception would be raised.

For example, if the sandbox contained the directory folder and the local_copy_list contained the items (_, 'file', './folder/file') this would work just fine in the original implementation as the file would be written to the folder on the remote folder. The current implementation, however, would write the file content to folder/file in a local temporary directory, and then iterate over the directories and copy them over. Essentially it would be doing:

Transport.put('/tmpdir/folder', 'folder')

but since folder would already exist on the remote working directory the local folder would be nested and so the final path on the remote would be /workingdir/folder/folder/file.

The correct approach is to copy each item of the local_copy_list from the local temporary directory individually using the Transport.put interface and not iterate over the top-level entries of the temporary directory at the end.

sphuber commented 5 months ago

Note that this is a critical bug that is currently on main as it breaks for example things as simple as running a PwCalculation from aiida-quantumespresso so this has to be fixed before release.

mbercx commented 5 months ago

Thanks @sphuber! @DrFedro also reported an issue related to this to me, i.e. that the pseudo folder would be copied inside an already existing one, causing QE to crash:

$ tree
.
├── CRASH
├── _aiidasubmit.sh
├── _scheduler-stderr.txt
├── _scheduler-stdout.txt
├── aiida.in
├── aiida.out
├── out
└── pseudo
    └── pseudo
        └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF

Can confirm that the changes in this PR fix that issue.

sphuber commented 5 months ago

Thanks @sphuber! @DrFedro also reported an issue related to this to me, i.e. that the pseudo folder would be copied inside an already existing one, causing QE to crash:

So he is running of the main branch, correct? Because that bug should not be present in an actual release. Can we merge this PR then? Or do you have doubts about the changes?

mbercx commented 5 months ago

So he is running of the main branch, correct?

Yes, it seems so, and reverting to v2.5.1 fixed the issue.

Can we merge this PR then? Or do you have doubts about the changes?

I still want to have a proper look at the code, so I can also make sure I understand the changes in https://github.com/aiidateam/aiida-core/commit/6898ff4d8c263cf08707c61411a005f6a7f731dd. Should have time for this tomorrow or Friday.

mbercx commented 5 months ago

Looking into this some more, won't the following line cause similar woes?

https://github.com/sphuber/aiida-core/blob/20fc81065731a05eb5ee1da95976bda504d7f67d/src/aiida/engine/daemon/execmanager.py#L293

I was playing around with the following code:

import os
import pathlib
import shutil
from tempfile import TemporaryDirectory
from aiida import orm, load_profile

load_profile()

localhost = orm.load_computer('localhost')

remote_workdir = '/Users/mbercx/project/core/jupyter/workdir'
pseudo_path = '/Users/mbercx/project/core/jupyter/data'

folder_data = orm.FolderData(tree=pseudo_path)

shutil.rmtree(remote_workdir, ignore_errors=True)

def copy_local(transport, folder_data):
    with TemporaryDirectory() as tmpdir:

        dirpath = pathlib.Path(tmpdir)

        data_node = folder_data

        filepath_target = (dirpath / 'pseudo').resolve().absolute()
        filepath_target.parent.mkdir(parents=True, exist_ok=True)

        data_node.base.repository.copy_tree(filepath_target, 'pseudo')

        transport.put(f'{dirpath}/*', transport.getcwd())

with localhost.get_transport() as transport:

    transport.mkdir(remote_workdir)
    transport.chdir(remote_workdir)

    copy_local(transport, folder_data)
    transport.copy(os.path.join(pseudo_path, 'pseudo'), 'pseudo')

The code above will give the following directory tree:

.
├── data
│   └── pseudo
│       ├── Ba.upf
│       └── Si.upf
├── sandbox.ipynb
└── workdir
    └── pseudo
        ├── Ba.upf
        ├── Si.upf
        └── pseudo
            ├── Ba.upf
            └── Si.upf

But switching the order of the copy_local and transport.copy will not result in the nested psuedo folder.

sphuber commented 5 months ago

But switching the order of the copy_local and transport.copy will not result in the nested psuedo folder.

Sure, but that is because you are calling the following

transport.copy(os.path.join(pseudo_path, 'pseudo'), 'pseudo')

And that is saying copy the contents of the source pseudo inside the pseudo target directory. It is not saying update the contents of the target pseudo dir with the contents of the source pseudo. What you are trying to do is:

transport.copy(os.path.join(pseudo_path, 'pseudo/*'), 'pseudo')

So you are globbing the contents of pseudo and copying that into the target pseudo. If you change this line, then your script works as expected. This is also exactly the change I added in this PR for the local copy.

So I don't think there is a regression in the behavior of transport.copy as it is used for the remote_copy_list is there?

mbercx commented 5 months ago

So I don't think there is a regression in the behavior of transport.copy as it is used for the remote_copy_list is there?

I don't think there is a regression, I was just wondering if we should make a similar change for remote_copy_list, since now the copying behavior for the two is different when the folder is already present due to a previous copying operation.

I rewrote the example to rely on the functions in the execmanager to see if I can break things (as I like to do):

from logging import LoggerAdapter
import shutil

from aiida import orm, load_profile
from aiida.common import AIIDA_LOGGER
from aiida.engine.daemon.execmanager import _copy_remote_files, _copy_local_files

load_profile()

random_calc_job = orm.load_node(36)
logger = LoggerAdapter(logger=AIIDA_LOGGER.getChild('execmanager'))

localhost = orm.load_computer('localhost')

remote_workdir = '/Users/mbercx/project/core/jupyter/workdir'
pseudo_path = '/Users/mbercx/project/core/jupyter/data'

shutil.rmtree(remote_workdir, ignore_errors=True)

folder_data = orm.FolderData(tree=pseudo_path)
folder_data.store()

local_copy_list_item = (folder_data.uuid, 'pseudo', 'pseudo')
remote_copy_list_item = (localhost.uuid, '/Users/mbercx/project/core/jupyter/data/pseudo/*', 'pseudo')

with localhost.get_transport() as transport:

    transport.mkdir(remote_workdir)
    transport.chdir(remote_workdir)

    _copy_local_files(logger, random_calc_job, transport, None, [local_copy_list_item])
    _copy_remote_files(logger, random_calc_job, localhost, transport, [remote_copy_list_item], ())

Critical of course are the local_copy_list_item and remote_copy_list_item definitions:

local_copy_list_item = (folder_data.uuid, 'pseudo', 'pseudo')
remote_copy_list_item = (localhost.uuid, '/Users/mbercx/project/core/jupyter/data/pseudo/*', 'pseudo')

Here, all is well, since I use the glob * to copy the contents of /Users/mbercx/project/core/jupyter/data/pseudo, as you noted. Removing the glob results in the nested pseudo dirs, but perhaps this is expected.

If we remove the glob, and invert the _copy_local_files and _copy_remote_files operations (as we currently allow):

local_copy_list_item = (folder_data.uuid, 'pseudo', 'pseudo')
remote_copy_list_item = (localhost.uuid, '/Users/mbercx/project/core/jupyter/data/pseudo', 'pseudo')

with localhost.get_transport() as transport:

    transport.mkdir(remote_workdir)
    transport.chdir(remote_workdir)

    _copy_remote_files(logger, random_calc_job, localhost, transport, [remote_copy_list_item], ())
    _copy_local_files(logger, random_calc_job, transport, None, [local_copy_list_item])

the behavior is different, i.e. there is no nested pseudo folder. I guess now my question is if this difference in behavior is problematic.

unkcpz commented 5 months ago

Since _copy_remote_files and _copy_local_files are not methods for public use, maybe it not a issue that the behaviors are not parallel with each other?

sphuber commented 5 months ago

Since _copy_remote_files and _copy_local_files are not methods for public use, maybe it not a issue that the behaviors are not parallel with each other?

Sure, but they are actually used by the user, albeit it indirectly. The local_copy_list and remote_copy_list instructions are actually passed directly to these functions. If they now require different instructions to achieve the same behavior, I can see how that would be confusing and inconsistent. For example if in local_copy_list you can/have to use relative_source/* and for remote_copy_list you don't use the glob. So I agree with @mbercx that we should look at it being consistent.

I need to try and find some time to try the example scripts on the latest release, to see what the original behavior was. I think the current changes on main only change the behavior of local_copy_list which is why I had to change it. But remote_copy_list should not have changed, so I think the highlighted problem of nesting should be present in v2.5.1 as well.

sphuber commented 5 months ago

TLDR: The solution of this PR is wrong. Not even so much for the discrepancy in behavior of local and remote copy lists, but really because the implementation of _copy_local_files does not directly respect the copying instructions.

This is a tricky one. The first question is what the behavior of Transport.put and Transport.copy (intuitively) should be. If you copy dir_a to dir_a, what should happen if a) dir_a does already exist in the target or b) dir_a does not exist. If we follow the behavior of cp, then there should be a difference in the two cases. If the target dir_a already exists, the dir_a source will be copied inside the existing directory, i.e., it will be nested. This is the current behavior of Transport.put and Transport.copy so it stands to be argued that that is the expected behavior.

The problem is really due to a detail of the original implementation of upload_calculation. There are three sources of input files that are provided by the calcjob plugin that are copied to the remote working directory:

  1. The sandbox folder
  2. The local_copy_list
  3. The remote_copy_list

The implementation did not literally copy the contents of each sequentially to the working directory. Rather, it would copy the instructions of the local_copy_list to the sandbox folder, and then it would copy the sandbox folder wholesale to the working dir. Critically, it did not use the Transport interface for the first operation, but simply used normal Python file operations since the sandbox sits on the local file system. The remote_copy_list was actually done through the transport interface directly to the working directory.

In the new implementation, this was changed, where the 3 copying steps are directly copied to the remote working dir, and the local_copy_list contents are not merged into the sandbox folder first. But this now leads to problem if the sandbox folder creates a folder on the remote, that also exists in the local_copy_list at which point the latter will now be nested. This happens for PwCalculation for example, because the plugin manually creates the pseudo folder in the sandbox, and then also copies the UpfData files to ./pseudo/filename through the local_copy_list.

In principle, getting rid of the "hack" of merging local_copy_list with the sandbox is a good thing in my opinion. However, the exact manner I did it in is causing the breaking of original functionality I believe. The _copy_local_files has to copy the files from the instructions to a temporary file on the local disk because the Transport interface does not allow file handles, and AiiDA's repository interface can only provide streaming handles. The problem is that I first copy all items of local_copy_list to a temp dir on disk, and then copy all the contents to the remote working dir, however, this is not identical to copying the local_copy_list instructions over one by one. So the solution, I think, is to copy each item of local_copy_list as soon as it is written to temporary file locally, instead of copying them all locally and then copying the temp dir in one go.

sphuber commented 5 months ago

Thanks for the critical review @mbercx . My original solution was horribly flawed and was trying to cover up a bad refactoring of the upload_calculation in an earlier commit. I am now pretty confident that I have found the correct solution which doesn't introduce differences between local and remote copy operations and does not have to touch the behavior of the transports whatsoever. I have tested the change with aiida-quantumespresso and it now works as expected.

Scratch that... it is still not quite that simple 😭

sphuber commented 4 months ago

@mbercx could you give this another review please? The behavior of the remote_copy_list and local_copy_list in your example is now the same once again and I believe the original behavior of the latest release is also restored.

mbercx commented 4 months ago

@sphuber just a note: I'm leaving on holiday tomorrow until the 20th, so will most likely not have time to review until after that... I agree the release should come out soon though. Maybe @giovannipizzi (due to his experience) or @khsrali (due to the fact that he's working on transports) can get involved in the review?

I think the discrepancy between local_copy_list and remote_copy_list in allowing the absence of parent folders is fine, although we could also allow this for remote_copy_list as well by making the parent folders as we do for remote_copy_list.

The fact that local_copy_list copies the contents of a directory instead of the directory itself seems somewhat strange. I suppose this was necessary to maintain backwards compatibility with the previous approach of merging the sandbox folder and local_copy_list?

sphuber commented 4 months ago

The fact that local_copy_list copies the contents of a directory instead of the directory itself seems somewhat strange

If and only if the target directory already exists right? Otherwise it just copies as is.

The problem here is indeed really the fact that the old implementation did not go directly through the Transport but had an intermediate merging step that used a different API and so we kind of now have to maintain that behavior in order to not break existing plugins, as demonstrated by PwCalculation breaking.

mbercx commented 4 months ago

If and only if the target directory already exists right? Otherwise it just copies as is.

Not sure if that's true, see my example near the end of https://github.com/aiidateam/aiida-core/pull/6348#discussion_r1594585968

GeigerJ2 commented 4 months ago

Thanks @sphuber! @DrFedro also reported an issue related to this to me, i.e. that the pseudo folder would be copied inside an already existing one, causing QE to crash:

Just a note that I also stumbled on this behavior now, and installing from the PR here fixes it. So I'll try to dedicate some time to reviewing the code and reading the discussion, maybe that could be helpful for merging.

rikigigi commented 4 months ago

I tested this PR for a while with aiida-quantumespresso, and so far I did not found any issues in the code behavior. I confirm that the main branch is completely broken. Can you add a test that covers this case before merging? Thank you

sphuber commented 4 months ago

Can you add a test that covers this case before merging?

This PR includes a regression test already Seems I have removed it in one of the iterations. I will see to adding it again.

mbercx commented 4 months ago

@sphuber as discussed, I've written a bunch of tests to check the edge cases I've been studying, first based on the latest release tag (v2.5.1):

https://github.com/mbercx/aiida-core/blob/test/upload-calc/tests/engine/daemon/test_execmanager.py

These should all pass for the previous implementation, see

https://github.com/mbercx/aiida-core/actions/runs/9285274053/job/25549435202

I've then added the same tests as a commit on top of this PR:

https://github.com/mbercx/aiida-core/blob/test/upload-calc-order-fix/tests/engine/daemon/test_execmanager.py

Here, 3 tests fail:

https://github.com/mbercx/aiida-core/actions/runs/9285404390/job/25549825351

  1. The sandbox folder is created with the pseudo folder and the local_copy_list tries to copy the contents of a SingleFileData to the target folder, see here. This would fail with a IsADirectoryError in v2.5.1, now it creates an unexpected file hierarchy:

    >               assert serialize_file_hierarchy(filepath_workdir) == expected_hierarchy 
    E               AssertionError: assert {'pseudo': {'... 'Ba pseudo'}} == {'pseudo': {'... 'Ba pseudo'}}
    E                 Differing items:
    E                 {'pseudo': {'pseudo': 'Ba pseudo'}} != {'pseudo': {'Ba.upf': 'Ba pseudo'}}
    E                 Use -v to get more diff

    I.e. a nested pseudo folder.

  2. The sandbox folder is created with the pseudo folder and the local_copy_list tries to copy the contents of the pseudo folder in a FolderData to the target pseudo folder, see here. Here also a nested hierarchy ({'pseudo': {'pseudo': {'Ba.upf': 'Ba pseudo'}}}) is created instead of the previous one ({'pseudo': {'Ba.upf': 'Ba pseudo'}}).

  3. Same as (2), but this time with a nested folder.

So the main point now is that the current copying behaviour of the PR does not preserve that of v2.5.1. I'll see if I can fix that first.

mbercx commented 4 months ago

Ok, I've extended the number of tests, added tests for the SSH transport, and managed to get it down to one failure (for SSH):

https://github.com/mbercx/aiida-core/actions/runs/9345893736/job/25719640192

See all changes here: https://github.com/mbercx/aiida-core/commit/fa0e9c5fc175e8d7cff34da6705fb80f1673d6b8

It seems the issue is that the local and SSH transport return different errors when trying to remotely copy a file to a directory that doesn't yet exist. The local transport returns a FileNotFoundError raised by the shutil.copy command:

https://github.com/aiidateam/aiida-core/blob/acec0c190cbb45ba267c6eb8ee7ceba18cf3302b/src/aiida/transports/plugins/local.py#L565

The SSH transport returns an OSError instead:

https://github.com/aiidateam/aiida-core/blob/acec0c190cbb45ba267c6eb8ee7ceba18cf3302b/src/aiida/transports/plugins/ssh.py#L1191-L1195

For some reason, the engine deals with these two error differently:

https://github.com/aiidateam/aiida-core/blob/acec0c190cbb45ba267c6eb8ee7ceba18cf3302b/src/aiida/engine/daemon/execmanager.py#L292-L304

I'm now trying to figure out why these two are treated differently.

EDIT: the reason is explained here: https://github.com/aiidateam/aiida-core/commit/101a8d61ba1c9f50a0231cd249c5a7f7ff1d77a4

So similar to the source file not existing, if the target path is a file in a directory that doesn't exist, the failure will not be transient and hence we should have the SSH transport return a FileNotFoundError to be consistent with the local transport.

sphuber commented 4 months ago

I'm now trying to figure out why these two are treated differently.

The OSError is a broad catch-all exception. The FileNoteFoundError is a subclass of OSerror and more specific. When copying remote files from remote_copy_list we don't want to raise when the source does not exist, but just warns. If the copy fails due to another reason, the problem may be more severe, and so it makes sense to let the calculation error out.

What we should do is check for the SshTransport what kind of OSError it is throwing. You can figure this out with through the errno attribute of the exception. You can use the constants in the errno module to figure out the meaning. For example:

try:
    os.remove(filename)
except OSError as e:
    if e.errno == errno.ENOENT:
        print('File does not exist')

See this PEP as well for details.

If the reason for the SshTransport is raising is the file is missing, we should have transport also raise the more specific FileNotFoundError. And then the test will pass

mbercx commented 4 months ago

Thanks @sphuber! Unfortunately, the .errno attribute of the exception that is raised seems to be None...

Instead I've added a check to see if the parent destination directory exists in the SshTransport.copy() method, see:

https://github.com/mbercx/aiida-core/commit/011f5f7b4dc911528b974d0427d2d2fb076add0d#diff-1284fda2c2014c8c418475c99a82308d82ab96cb33e6c49d978c0031f3b1ef3d

It's more patchy for sure, but at least my tests now all pass locally. 😅 I'm praying they also pass on GitHub:

https://github.com/mbercx/aiida-core/actions/runs/9347879530/job/25725810490

mbercx commented 4 months ago

Also double-checked my branch adding these tests on v2.5.1, and can confirm they all pass, save for the same error discussed above.

mbercx commented 4 months ago

My suggestion for proceeding is to split up the PR in 3 commits:

  1. SshTransport: Return FileNotFoundError if destination parent does not exist
  2. Transports: overwrite existing folders if overwrite == True
  3. Engine: Recover behavior for upload_calculation after refactor
sphuber commented 4 months ago

Superseded by #6447