aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
427 stars 186 forks source link

Missing migration for source_file attribute #4401

Open ltalirz opened 3 years ago

ltalirz commented 3 years ago

This was discovered together with @asle85 when inspecting migrated databases on Materials Cloud.

Describe the bug

Early aiida-core versions (pre 0.11.1) used to store the whole source file of an inline function in a source_file attribute of the corresponding node.

The source_file attribute was removed in #1082, and storing the source file in the repository was added in #1196, both were merged between aiida-core versions v0.11.0 and v0.11.1

However, it seems a migration was never added, i.e. after migrating databases with pre-0.11.1 calculations to v1.X, the nodes still have the source_file attribute and the source file is not added to the file repository:

In [1]: n = load_node('da3793cf-5b5b-4b6a-b4aa-2cffc5202b04')

In [2]: type(n)
Out[2]: aiida.orm.nodes.process.calculation.calcfunction.CalcFunctionNode

In [3]: list(n.attributes)
Out[3]:
['sealed',
 'namespace',
 'source_code',
 'source_file',
 'function_name',
 'first_line_source_code']

In [4]: n.list_object_names()
Out[4]: []

The EXPLORE frontend on Materials Cloud expects to find the source file in the file repository and therefore errors when encountering such nodes, e.g. https://dev-aiida-dev.materialscloud.org/2dstructures/api/v4/nodes/da3793cf-5b5b-4b6a-b4aa-2cffc5202b04/repo/contents?filename=source_file

Steps to reproduce

Here is an example export file to try: source_file.zip

Expected behavior

Both new and old nodes should store the source file in the same place.

In my opinion, the source code is a very important part of the provenance, and so I believe we need to introduce a migration for this. Luckily, all the information seems still to be present, i.e. we don't need to ask people to re-migrate old databases.

Mentioning @giovannipizzi @sphuber for input.

giovannipizzi commented 3 years ago

Here is how I would proceed: I would make sure that the frontend does not make assumptions on the existence of the source code. If it's there, it should be visualised, but if it's not there, nothing should be displayed. This should be true for any attribute in general, we should only do assumptions on basic things like uuid, links, ...

Regarding the migration: I'm not sure, for a few reasons:

[BTW this issue makes it even more important to have always (in Materials Cloud), for any node, the 'raw' view that we have been discussing for a long time, where you can see (and download) any raw attributes (say as JSON, and/or visualise them) and any raw file in the repository - in this way if old nodes have data that is in a different place, it is still possible for people to retrieve it in some way - compare with the 'raw download' of GitHub. @asle85 I would suggest that you think at how to prioritise this, as soon as the urgent migrations are done]

ltalirz commented 3 years ago

Thanks for the input Gio!

I certainly agree that on the frontend side we should try to be as tolerant as possible to expected information not being there. This is actually already the case here: https://dev-www.materialscloud.org/explore/2dstructures/details/da3793cf-5b5b-4b6a-b4aa-2cffc5202b04?nodeType=NODE image

Observant users will just see a failed GET request in the JS console image

It means, however, that from an AiiDA 1 perspective, the bit of provenance stored in the source code is effectively lost (it is still stored in the node but in a place where you are not supposed to look).

Concerning data migrations, I don't really follow the argument of the breaking scripts in this particular case. We're talking about nodes here that were generated pre-0.11.1, i.e. any script would need to be migrated up to 1.4.0, involving way more major changes to the API and database schema than what we are talking about here. Furthermore, the migration we are talking about here would actually take care of making sure that these CalfFunctionNodes correspond to the AiiDA 1.4.0 attributes for such nodes, which they currently do not.

In the recent AiiDA article, we write

[...] existing data and their provenance can be automatically migrated, and are therefore guaranteed to remain compatible with the present version of AiiDA

To me, this means that my previous calculations will be represented as closely as possible in the new format. This matters not only for GUIs, but also for reuse of these calculations in new work.

sphuber commented 3 years ago

I think I tend to agree more with @ltalirz here. In the past we have added many data migrations to make older nodes behave similar to their newer counterparts. Also for the legacy JobCalculations that you mention we have added quite some migrations, for example to give them an exit status if they were failed, even though the concept didn't exist at the time.

Also here, as a user, I would expect to have an old InlineCalculation to behave like a CalcFunctionNode after it is migrated. It is migrated to that new node type for a reason after all. So I would expect that all CalcFunctionNodes behave more or less the same. Especially given that the data is still there and so we can migrate the data (we simply forgot) I don't really see a reason not to do it. Sure it is extra work, and migrations that include the file repository are always tricky, but I don't think that is an excuse.

I also don't really follow that we should minimize data migrations. Especially given that we can still now do it, I think we should. Also you mention data plugins, where I agree that it can be a bit more sensitive, but here we are talking about a node type, which are defined by us and cannot be extended through plugins. So I don't see the downside of performing the migration. I really doubt there is anyone out there that wrote script that depend on legacy migrated inline calculation functions to have their source code stored in the database as opposed to the repository.

giovannipizzi commented 3 years ago

Hi, migrations are always a complex topic... anyway, to clarify my point on queries not working anymore: it's not only queries of 0.x that now would break (and I agre that's more or less OK). It's queries of that data imported in 1.4.1 that now work and will break in 1.4.x or 1.5.x or wherever the migration will be imported. This to me seems quite nasty. Anyway, if you think it's important to migrate, I'm OK so please feel free to write the migration!

sphuber commented 3 years ago

Hi, migrations are always a complex topic...

Amen :)

It's queries of that data imported in 1.4.1 that now work and will break in 1.4.x or 1.5.x or wherever the migration will be imported. This to me seems quite nasty.

What queries would this be though? Are you specifically referring to the file source content? Because who would be querying for the file content? It is possible, but is this a likely use case?

That being said, I notice that there are more inconsistencies in the old attributes of calculation functions and those that are currently used. Here is a translation table (based on the attribute names that Leo reported in his OP):

Old New
namespace function_namespace
source_code n.a.
source_file source_file
function_name function_name
first_line_source_code function_starting_line_number

So there are two more attribute names that have changed that have not been migrated.

Ultimately, there are two options:

  1. We leave it as is, leaving ORM broken for legacy InlineCalculations but existing queries that rely on old attributes will continue to work
  2. Add migration which will make all CalcFunctionNodes behave the same through the ORM, but existing queries that rely on legacy attribute names will break

We can present this to the team and have a vote

ramirezfranciscof commented 3 years ago

Ok, after doing some reading and tests I'll try to rephrase some of the information here to make sure I understood as well as updating with the new gathered information. We currently have 2 kind of CalcFunctionNode stored in the DB: the ones created after the changes in the schema and the ones migrated from before. These two have different sets of attributes that we want to make uniform.

Here are the keys I get for the attributes of a migrated InlineCalculation ("Legacy") and for a newly created calcfunction ("Current"). This is similar to the table included by @sphuber above but I also included keys present in new calcfunction that are not in migrated nodes, as well as columns for the description of the attribute and what to do with it in the migration.

Legacy Key Current Key Description To Do
sealed sealed [bool] Wether the node is seal or not  Nothing to do
function_name function_name [str] Name of the function (from calcfunc.__name__) Nothing to do
namespace function_namespace [str] Name of the module (from calcfunc.__globals__['__name__']) Just rename key
source_code N.A. [str] Section of the code for the function Get the information for a new function_number_of_lines attribute (see #4554) and remove this attribute
source_file N.A. [str] Whole file that contains the function Move to repository and remove attribute
first_line_source_code function_starting_line_number [int] Starting line of the function Just rename key
N.A. version [dict] Version of Aiida that created the node? Example value: {'core': '1.5.0'} Ignore (no attribute)
N.A. exit_status [int] Exit status of the node Set to the finished number (0)
N.A. process_state [str] Last state of the process Set to the finished status enum
N.A. process_label [str] For calcfunctions seems to be the same as the name (confirming) Copy the string from function_name?

All of these of course would only apply for node_type == process.calculation.calcfunction.CalcFunctionNode.

~Also, perhaps instead of "unknown" strings I should use None/NULL? I don't know what convention we were using so far~.

Feedback on this? (ex: if the proposed to-dos are what you had in mind) Ping @sphuber @giovannipizzi @ltalirz

ramirezfranciscof commented 3 years ago

The only thing I still couldn't find is how exactly the process_label is set for CalcFunctionNode. Inside of the module aiida.engine.processes.process there is this:

    def _setup_db_record(self) -> None:
        """
        Create the database record for this process and the links with respect to its inputs

        This function will set various attributes on the node that serve as a proxy for attributes of the Process.
        This is essential as otherwise this information could only be introspected through the Process itself, which
        is only available to the interpreter that has it in memory. To make this data introspectable from any
        interpreter, for example for the command line interface, certain Process attributes are proxied through the
        calculation node.

        In addition, the parent calculation will be setup with a CALL link if applicable and all inputs will be
        linked up as well.
        """
        assert self.inputs is not None
        assert not self.node.is_sealed, 'process node cannot be sealed when setting up the database record'

        # Store important process attributes in the node proxy
        self.node.set_process_state(None)
        self.node.set_process_label(self.__class__.__name__)
        self.node.set_process_type(self.__class__.build_process_type())

        parent_calc = self.get_parent_calc()

        if parent_calc and self.metadata.store_provenance:

            if isinstance(parent_calc, orm.CalculationNode):
                raise exceptions.InvalidOperation('calling processes from a calculation type process is forbidden.')

            if isinstance(self.node, orm.CalculationNode):
                self.node.add_incoming(parent_calc, LinkType.CALL_CALC, self.metadata.call_link_label)

            elif isinstance(self.node, orm.WorkflowNode):
                self.node.add_incoming(parent_calc, LinkType.CALL_WORK, self.metadata.call_link_label)

        self._setup_metadata()
        self._setup_inputs()

But I'm not sure where exactly is self.__class__.__name__ defined. I would have expected that this has to be set up somewhere in aiida.engine.processes.functions:FunctionProcess, but I can't find it (or, at least, I don't understand why the self.__class__.__name__ ends up returning the name of the function instead of Process or FunctionProcess).

I will start working on this, assuming that for migrating CalcFunctions I can just copy the function_name into process_label. @sphuber : could you please later take a quick look to see if everything I'm saying sounds ok, and maybe clarify how these two attributes differ and how the latter is set?

giovannipizzi commented 3 years ago

I guess the definition of __name__ is here (@sphuber feel free to correct me)

sphuber commented 3 years ago

I guess the definition of __name__ is here (@sphuber feel free to correct me)

Exactly, that is where we construct an actual Python class on the fly, basing its base attributes on that of the function it is based on. That is why process functions behave as if they were process classes.

sphuber commented 3 years ago

@ramirezfranciscof regarding your table it is all good. Just a few comments: