Could we add an "energy_unit" attribute to the estimators?

alchemistry / alchemlyb

the simple alchemistry library

https://alchemlyb.readthedocs.io

BSD 3-Clause "New" or "Revised" License

197 stars 51 forks source link

Could we add an "energy_unit" attribute to the estimators? #244

Closed DrDomenicoMarson closed 2 years ago

DrDomenicoMarson commented 2 years ago

I was thinking, it could be interesting to have the energy_unit attribute that we now have attached to every parsed DataFrame also attached to the estimators? In this way for example, when one uses the plot_ti_dhdl function, the function itself can get the original units of the data from the input data, without relying on the input of the user.

DrDomenicoMarson commented 2 years ago

Following up on units and visualization, maybe there is an inconsistency in plot_convergence.

In the notes is stated that if the data supplied is a Dataframe as the one produced by 'forward_backward_convergence ', the units parameter is just used to change the labeling in the plot, and changing it won't change the underlying variable.

    Note
    ----
    The code is taken and modified from
    `Alchemical Analysis <https://github.com/MobleyLab/alchemical-analysis>`_.

    If `data` is not an :class:pandas.Dataframe` produced by
    :func:`~alchemlyb.convergence.forward_backward_convergence`,
    the unit will be adjusted according to the units
    variable. Otherwise, the units variable is for labelling only.
    Changing it doesn't change the unit of the underlying variable.

I think this is not the case now, as on lines 61-62 we do:

    if len(data) == 1 and isinstance(data[0], pd.DataFrame):
        dataframe = get_unit_converter(units)(data[0])
        forward = dataframe['Forward'].to_numpy()
        forward_error = dataframe['Forward_Error'].to_numpy()
        backward = dataframe['Backward'].to_numpy()
        backward_error = dataframe['Backward_Error'].to_numpy()
    else:
        try:
            forward, forward_error, backward, backward_error = data
        except ValueError: # pragma: no cover
            raise ValueError('Ensure all four of forward, forward_error, '
                             'backward, backward_error are supplied.')

which I think will convert all the data to the units with which the user issued the function ('kt' as default) if a DataFrame is given, otherwise it will just keep the data as it is, and never try to convert it to the requested units. Using the function with my dataset, which is in kcal/mol, indeed gave me results in kt, actually converting all my data to 'kt', not just labeling incorrectly the plot.

Maybe the documentation is wrongly written, I think what we are doing, and what is more straightforward to implement and easy to understand, is that IF the input data is from our workflow ('forward_backward_convergence ') we KNOW we can use our units converters, so we can rely on the units parameter to be used as an indication that the user want to change the data to a different set of units. While, if the input data is not from our workflow, little we know about the units of the data, so no conversion can be easily made, and so the unit parameter is just for labeling, and the user is responsible to have coherence between data and unit. We should also default units to None maybe, because if data comes from our workflow, having it default to "kt" just convert everything to "kt" if the user doesn't pass the correct units, and this I don't think is what the users want, as when using this function in our workflow, units are usually already settled before, when data is parsed. In the case of different data in input of which we can't retrieve the units, we can check if units is None and ask the user to supply the units in that case.

Another possible related issue here is that we just check if isinstance(data[0], pd.DataFrame. As the code is right now this is not a big problem, because we actually don't need/use (but if we should, standing to what we say in the documentation) the .attrs that forward_backward_convergence add to the DataFrame. So, if we want to be coherent with the documentation, we should check here if data[0] is a DataFrame & if it has the attrs we need, and then just use the attrs to set the units correctly.

I hope not to have misunderstood anything, please let me know how you want to proceed!

xiki-tempula commented 2 years ago

I was thinking, it could be interesting to have the energy_unit attribute that we now have attached to every parsed DataFrame also attached to the estimators?

We want a single source of truth where there is only one place to look for certain information. It might cause confusion to the user as to where to look for the temperature when the information exists both in the data frame and estimators.

In this way for example, when one uses the plot_ti_dhdl function, the function itself can get the original units of the data from the input data, without relying on the input of the user.

The function gets the original units of the data from the input data. When the input unit is different from the unit of the dataframe, the unit conversion will be performed to ensure the output figure has the correct unit and correct value.

In the notes is stated that if the data supplied is a Dataframe as the one produced by 'forward_backward_convergence ', the units parameter is just used to change the labeling in the plot, and changing it won't change the underlying variable.

I must be drunk when I wrote that doc, it should be

Ifdatais an :class:pandas.Dataframe produced by :func:~alchemlyb.convergence.forward_backward_convergence, the unit will be adjusted according to the units variable.`

We should also default units to None maybe, because if data comes from our workflow, having it default to "kt" just convert everything to "kt" if the user doesn't pass the correct units, and this I don't think is what the users want, as when using this function in our workflow, units are usually already settled before, when data is parsed.

So the unit is a new thing while the visualization is an old thing. The primary goal is to ensure that the old scripts that people had, works as intended. By default, all of the units would be "kt", so we want to establish the expectation that the unit is expected to be kt unless people changed it. So if people want to have other units, they have to actively change it in the plot_convergence.

DrDomenicoMarson commented 2 years ago

We want a single source of truth where there is only one place to look for certain information. It might cause confusion to the user as to where to look for the temperature when the information exists both in the data frame and estimators.

That's my fault, I noticed, now that I read the source code for the estimators, that every DatFrame in the estimators objects inherit attrs from the "parent" DataFrame. Maybe the docs are to be updated? In the docs delta_f_ and the other attributes of the estimators are stated to be "dimensionless", while they are indeed in the dimension of the DataFrame to which the estimator was fitted, and have already their own energy_unit set. Maybe we could implement a method like conver_to into the estimators, as right now one could do

ti = TI().fit(df)  # df is in kcal/mol
# now  ti.delta_f_, ti.d_delta_f_, and ti.dhdl are all in kcal/mol
ti.delta_f_ = to_kT(ti.delta_f_)
# now the user has an estimator with delta_f_ values in kT and the d_delta_f_ in kcal/mol

and this would lead to an "inconsistent" state for the estimator, while calling this

class TI():
  ...
  ...
  def convert_to(self, units):
     convert = get_unit_converter(units)
     self.delta_f_ = convert(self.delta_f_)
     self.d_delta_f_ = convert(self.d_delta_f_)
     self.dhdl = convert(self.dhdl)

the estimator will always be consistent. What do you think about it?

PS: in the Parameters of plot_ti_dhdl we are stating that units is just for labeling (hence my initial confusion), while now is also used to properly change the energy unit (as it is reported in the versionchanged below). I think the Parameters section should reflect the actual behavior, and we should not rely on the user to read the versionchanged.

DrDomenicoMarson commented 2 years ago

Ifdatais an :class:pandas.Dataframe produced by :func:~alchemlyb.convergence.forward_backward_convergence, the unit will be adjusted according to the units variable.`

Great, so this is just a quick fix in the documentation then :-)

DrDomenicoMarson commented 2 years ago

By default, all of the units would be "kt", so we want to establish the expectation that the unit is expected to be kt unless people changed it. So if people want to have other units, they have to actively change it in the plot_convergence.

I'd argue that moving to version 1 we could change this default behavior to rely more on the new attrs that everything in the workflow will now have. Usually, energies in papers are reported as kj/mol or kcal/mol, and, for example, anyone working with AMBER is much more familiar with dealing with energy in kcal/mol (the same goes for kj/mol for GROMACS users I think). Moreover, we are inconsistent because our estimators don't have a units parameter, so they work with whatever they are given.

Right now, for example, I'm parsing the AMBER files (which are in kcal/mol), I obtain DataFrames in kT which I readily convert to kcal/mol, then I have always to remember to pass the units' keyword to all subsequent functions (except for the estimators, which are able to deal with my DataFrame as it is). And if for some reason I want to change the beginning of my script and convert the parsed data to kj/mol (maybe because the journal wants everything in kj/mol [it happened :-(] then I have to remember to change all theunitsparameter in all subsequent functions (right now I'm passingunits=input_df.attrs["energy_units"]` just to avoid that, which I think is a needless complication).

So, I think either we force users to use units in every step of the workflow (also in the estimators) or we make all alchemlyb functions able to get the units from the DataFrames and don't rely on a default unit. But this is just my opinion, everything is fine also as it is now, once the documentations are properly fixed :-)

xiki-tempula commented 2 years ago

In the docs deltaf and the other attributes of the estimators are stated to be "dimensionless", while they are indeed in the dimension of the DataFrame to which the estimator was fitted, and have already their own energy_unit set.

Yes, this is an issue and should be fixed.

Maybe we could implement a method like conver_to into the estimators

Sorry, do you mind giving a use case where this could cause some issues?

I think the Parameters section should reflect the actual behavior, and we should not rely on the user to read the versionchanged.

I do agree that the doc needs to be written in a more clear fashion but we do expect the user to read versionchanged which is why we take time to write it in the first place.

So, I think either we force users to use units in every step of the workflow (also in the estimators) or we make all alchemlyb functions able to get the units from the DataFrames and don't rely on a default unit. But this is just my opinion, everything is fine also as it is now, once the documentations are properly fixed :-)

Sorry, I don't quite understand what the problem is. I think the main problem is that I wrote most of the unit conversion and visualization code, so they are written in a way which I think is clear, so I would rely on the normal users to tell me why it doesn't make sense.

My thought is that the preprocessing, estimator, post-processing code are all unit-conscious, which means that they can take a dataframe with any unit and perform the required operation. The user only needs to do the unit conversion on the final value.

Say if the user wants a float, they use to_kcal on the resulting dataframe to get the final value. If the user wants a plot, they use the unit keyword to specify the unit of the final plot. So the unit conversion is only done in the last step and all the steps before should not require any unit conversion.

DrDomenicoMarson commented 2 years ago

Sorry, do you mind giving a use case where this could cause some issues?

I can't think of a use case in particular, it seems just strange that an user can change the internal state of part of the estimator somewhere in the code, then accessing the estimator somewhere else in the code he could forget he had altered it and print deltaf and d_deltaf which are in different units :-) And the implementation of a convert method it's straightforward, but this is not a big deal actually!

DrDomenicoMarson commented 2 years ago

I do agree that the doc needs to be written in a more clear fashion but we do expect the user to read versionchanged which is why we take time to write it in the first place.

Speaking as an user, I usually don't read the versionchanged if it is the first time I use a package, because I assume that in the versionchanged are reported changes in the behaviour between versions, but as I'm starting to use the software right now, I assume what I have to know is reported in the documentation/notes. I think it's important that versionghaged is added, but more important is that the notes/documentation are coherent with the current behavior of the code. Right now I find the documentation is pretty clear, we have just to adjust this couple of things.

DrDomenicoMarson commented 2 years ago

Sorry, I don't quite understand what the problem is. I think the main problem is that I wrote most of the unit conversion and visualization code, so they are written in a way which I think is clear, so I would rely on the normal users to tell me why it doesn't make sense.

I don't think it doesn't make sense, I'm sorry if it seemed so. Right now it's all working pretty well!

The thing is, while I was using the library, my first approach was: I read all from Amber, I got from the parser values in kT, I readily change my DataFrame to kcal/mol and expect that after that I could forget about units. So while I implemented in my workflow the different parts of the analysis, it was strange to me that every time I had always to add the units I wanted, as from my point of view I already have stated that my DataFrame is in kcal/mol and I want to work with this units.

The reason I prefer to change my DataFrame right at the beginning to kcal/mol is that I want to be able to inspect that my scripts are doing what I want them to do, so implementing the workflow at the beginning I print quite often some values from the DataFrames of the file I read, or after having concatenated some files, and to check that the values were correct I needed to have them in kcal/mol to compare with the actual data in the Amber files.

In the beginning, I thought that alchemlyb needed to assume everything was in kT because the different functions couldn't know the units of the data they were processing, but digging deeper I noticed that every df is "tagged" with the units value, so it seemed a little bit redundant to have every time to specify the units.

xiki-tempula commented 2 years ago

then accessing the estimator somewhere else in the code he could forget he had altered it and print deltaf and d_deltaf which are in different units

Make sense, I will make it that one could only get deltaf and d_deltaf but cannot be set them. We don't want the user to modify the estimator object, as this could give raise to the inconsistency issue that you have suggested.

The reason I prefer to change my DataFrame right at the beginning to kcal/mol is that I want to be able to inspect that my scripts are doing what I want them to do, so implementing the workflow at the beginning I print quite often some values from the DataFrames of the file I read, or after having concatenated some files, and to check that the values were correct I needed to have them in kcal/mol to compare with the actual data in the Amber files.

This does make sense. I do the unit conversion at the end and you do it at the beginning.

it seemed a little bit redundant to have every time to specify the units

Correct me if I'm wrong. So I think currently only the visualization part of the code will ask the user to specify the unit. I will try to think of a way that all the visualization function would have a uniformed logic.

DrDomenicoMarson commented 2 years ago

Make sense, I will make it that one could only get deltaf and d_deltaf but cannot be set them. We don't want the user to modify the estimator object, as this could give raise to the inconsistency issue that you have suggested.

This is a great idea, yeah!

orbeckst commented 2 years ago

Speaking as an user, I usually don't read the versionchanged if it is the first time I use a package, because I assume that in the versionchanged are reported changes in the behaviour between versions, but as I'm starting to use the software right now, I assume what I have to know is reported in the documentation/notes. I think it's important that versionghaged is added, but more important is that the notes/documentation are coherent with the current behavior of the code. Right now I find the documentation is pretty clear, we have just to adjust this couple of things.

I agree with you @DrDomenicoMarson . A user should be able to read the docs and know what's happening.

This is the ideal to strive for. In reality, not everything is done perfectly and something slips through. The absolutely most helpful thing to do in the case of doc issues is to immediately raise a PR with the proposed fix. Don't wait for any of us to fix it: if you find a problem and have an idea how to write it better then you're the most qualified person to fix it because you have the correct perspective for reading the docs.

When I do doc fix PRs I typically just use the GitHub webinterface to edit a file directly. That will create a PR without me having to worry to much about branches on my local repo. The key is that this is a mini-PR that can get reviewed quickly, which is infinitely better than adding doc fixes in big PRs or waiting for an enormous "fix all docs" PR.

orbeckst commented 2 years ago

I am going to mark the issue itself as "won't fix" because we're keeping the energy associated with the data, as @xiki-tempula explained.

Please raise a separate issue for any documentation problems or just fix the docs with a PR.