aiidateam / aiida-wannier90

AiiDA plugin for the Wannier90 code
https://aiida-wannier90.readthedocs.io
Other
9 stars 14 forks source link

Remove opening transport during `prepare_for_submission` #37

Closed giovannipizzi closed 4 years ago

giovannipizzi commented 4 years ago

Currently, during the prepare_for_submission we open a transport to inspect the files available on the remote, and then perform logic to decide what to do:

https://github.com/aiidateam/aiida-wannier90/blob/68c03557465b5e0e44d95c5e92549175460abd06/aiida_wannier90/calculations.py#L142-L149

However, this is not ideal, because

While we discuss if this can be changed in AiiDA itself (see aiidateam/aiida-core#3477), I think we can apply the following different design: instead of inspecting the files and deciding what to do, we should instead ask explicitly the user where the files are (e.g. via specific flags in the settings input node, and maybe require these only if we pass both a remotedata input and a folder data input)? Anyway, for single simulations, the user can check; and for workflows, the workflow probably knows what the previous step retrieved.

greschd commented 4 years ago

@giovannipizzi after looking into this, I realize I had misunderstood this issue: For some reason, I assumed we could just specify the full list of files that could possibly be copied. Instead, of course AiiDA will check if the files are actually present, and fail if they are not.

Explicitly specifying the files is problematic because of the UNK* files, whose filenames are somewhat dynamic.

I think we should decide what the long-term interface should be, and then implement a (maybe suboptimal) version that will be forward-compatible with that. If it's feasible to keep the listdir of a RemoteData in the local DB, I would actually prefer the current interface.

For reference, here's the list of possible input files

seedname.mmn
seedname.amn
seedname.dmn
seedname.eig
seedname.win # Generated by the plugin
seedname.chk # Can also be OUTPUT
UNKp.s
seedname_htB.dat # Can also be OUTPUT
seedname_htL.dat
seedname_htR.dat
seedname_htC.dat
seedname_htLC.dat
seedname_htCR.dat
seedname.unkg
seedname.uHu
seedname.spn
giovannipizzi commented 4 years ago

Hi @greschd , thanks!

Indeed, for the local copy list, all files need to be specified. But for these, we have them in the repository so we can expand the '*' if needed.

For the remote copy list, the call is here:

https://github.com/aiidateam/aiida-core/blob/67d4504c0c02bdb09497600f7ca15ff56e0575a5/aiida/engine/daemon/execmanager.py#L220

and transport.copy supports globbing (currently in the plugin we go and expand UNK* to the explicit list of files, but this is not needed, and we could just say UNK* in the source, and a folder name in the destination).

The only problem I see (maybe we discussed this already) is the order of operations - now the files in the local_copy_list are transferred first, and might be overwritten by the remote_copy_list (and probably we would like by default to have the opposite behaviour?). In the future we can have in AiiDA a flag to swap the order.

But if we go with #76, this is not a problem right? Because we can only specify one parent_folder, either FolderData or RemoteData. We just need to put in some logic where we expand the * with the files_finder only if we have a FolderData - would this work?

greschd commented 4 years ago

Aha! Yeah this might work.. so the logic would be

is that correct?

Of course we won't have the validation or required input files in the case of a remote input, but I think that is unavoidable.

We probably want to keep additional_remote_symlink_list, additional_remote_copy_list and additional_local_copy_list, right?

giovannipizzi commented 4 years ago

I think what you suggest is ok.

A comment on one thing that I just realised, however - if you call transport.copy with a specific filename, if will fail if it does not exist, while it does not raise if it's a globbing pattern that does not match anything... Maybe this is a bit of an unexpected behaviour (and this might make things a bit more complex in our case).

See e.g.

In [21]: t.copy('sd?', 'output')                                                

In [22]: t.copy('sdf', 'output')                                                
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-22-3a076541162d> in <module>
----> 1 t.copy('sdf', 'output')

~/git/aiida-core/aiida/transports/plugins/local.py in copy(self, remotesource, remotedestination, dereference, recursive)
    552         if not self.has_magic(remotesource):
    553             if not os.path.exists(os.path.join(self.curdir, remotesource)):
--> 554                 raise OSError('Source not found')
    555         if self.normalize(remotesource) == self.normalize(remotedestination):
    556             raise ValueError('Cannot copy from itself to itself')

OSError: Source not found

Maybe an approach would be to symlink files using the pattern *.amn ignoring what the seed name is? Or other ideas?

greschd commented 4 years ago

Yeah, I would just glob on the suffix, then.

As a long-term solution, maybe we can introduce the concept of "required" vs. "optional" files in AiiDA itself? I think this only makes sense if other codes use a similar pattern, though.

It's also just a bit too easy to accidentally open a transport. Here we do it explicitly, but I think this could be hidden in a call to RemoteData.listdir.

giovannipizzi commented 4 years ago

Yeah, I would just glob on the suffix, then.

Great!

As a long-term solution, maybe we can introduce the concept of "required" vs. "optional" files in AiiDA itself? I think this only makes sense if other codes use a similar pattern, though.

It's a good point. Do you want to open an issue on aiida-core (I don't think we have one), both for the order of local vs. remote, and for this?

It's also just a bit too easy to accidentally open a transport. Here we do it explicitly, but I think this could be hidden in a call to RemoteData.listdir.

It's a good point. Suggestions welcome on how to change this (maybe we have to change RemoteData.listdir to have a specific option open_transport=True, and set this by default to False, and use some internal cache as you were suggesting? I think this is also good for an issue on aiida-core (that might go to 2.0 though, since it's backward-incompatible)