SSCHAcode / python-sscha

The python implementation of the Stochastic Self-Consistent Harmonic Approximation (SSCHA).
GNU General Public License v3.0
54 stars 21 forks source link

Comply with `PEP 8` and `PEP 621` standards #138

Open bastonero opened 1 year ago

bastonero commented 1 year ago

PEP 8 is now the standard for Python packages since a decade, and it might limit and make more difficult the usage of the code. Here, I also suggest to comply to the PEP 621 standards, which removes the deprecated setup.py in favour to the pyproject.toml.

mesonepigreco commented 11 months ago

Hi Lorenzo. Is it possible to use in the pyproject.toml the Fortran support for compilation of the deprecated numpy distutils? I think a modernization of the installation is for sure needed, but we are now limited to the fact that the python and numpy decided to drop support for Fortran extensions, particularly those requiring Lapack and Blas dependencies. I saw most projects solve the issue by distributing custom binary wheels, but this is not that easy for us, as, it is my knowledge, there is no free auto build of binaries for any platform supporting complex Fortran extension.

Also, as a scientist, I find very depressing having to care to this stupid stuff which take away a lot of time without changing anything in the code workflow. Time that could be spent improving the project, developing new algorithms and adding more feature to do new science.

bastonero commented 11 months ago

Hi Lorenzo, regarding the use of pyproject.toml I found this project using it via setuptools and wheels. Don't know honestly how much time consuming would be, but I guess sooner or later these changes will have to be made to support newer python versions.

While I do understand your viewpoint, consider that standardizing the code to at least PEP8 is rather urgent. It is in fact a bit annoying to navigate the current code while standards on all other common used codes are different. ase, pymatgen, flare and so on support these standards. And also it makes it easier to maintain the code, which is getting rather big.

Regarding the pyproject.toml, I think you will need it anyways, so it is a one-timer effort here, which will probably make even easier to install the code and to release a conda installation (especially for the future where flare-sscha-aiida will need to be installed in the same environment).

Happy to help in the transition.

mesonepigreco commented 11 months ago

Yes, I agree. also numpy.distutils is quite buggy and causes errors in the compilations on new M1 machines. Since it is deprecated, there are no hopes that it will be fixed. I see the urgency. The major problems is that the new setuptools does not support Fortran and LAPACK builds, that our code uses quite heavily. So we should use an external build mechanism for the fortran extension, like scipy does, for example, with MESON: https://github.com/scipy/scipy

And use f2py to generate the wrappers between fortran and python. It is probably not so difficult, but requires quite a bit of time. Scipy should be the reference project to look at as they are compiling tons of Fortran+LAPACK code.

Regarding the code update to PEP8, I really care about backcompatibility. It is extremely annoying for a user to have prepared all his script, and update sscha, and discover that nothing works anymore because they are calling DiagonalizeSupercell instead of diagonalize_supercell.

Probably to get backcompatibility, we can implement in all classes the __getattribute__ function with something like:

def __getattribute__(self, name):
       # Substitute uppercase with lowercase and insert _ to split the words in name
       # and save into new_name
       attr = object.__getattribute__(self, new_name)
       if name != new_name:
             warnings.warn(f"{name} is deprecated, use {new_name} instead.")
       return attr

In this way we can safely rename all the methods, and keep backcompatibility. I do not know how much this could impact on performances, but maybe we can use some macros to deactivate the check in version 2.0 if an environmental variable is not set, so that we still keep backward compatibility.

bastonero commented 11 months ago

Sorry for the late reply.

My humble opinion on backwards compatibility: versioning of the code is made on purpose to say "from this version there will be breaking changes", hence if you want the newest version this and this changed. This is usually the convention for the first leading number. The amount of users using the code is still limited, so better now than later. Morevoer, it is true that is annoying, but sooner or later such users will have to do it either way, and changing 50 lines of script is rather fast. Moreover, most of the names will be kept quite the same: Phonons -> phonons; GetThisAndThat -> get_this_and_that; Variable -> variable.

Plus, no-one is breaking anything until one updates his own compiled code, which at a certain point will have to happen.

Also, keeping backwards compatibility for the entire code which has to be refactored can lead to different unwanted outcomes, e.g. the package will be heavier for no actual benefits, developing wise will be hell, possibly doing mistakes and annoying eventually users anyways.

Also in my opinion sticking to this standards soon will attract more people to the usage of the code(s).

mesonepigreco commented 11 months ago

Hi, I'm afraid I have to disagree. If I spend time writing code in a language with more than 30 years of life, like Python, I expect it to compile forever without needing daily maintenance.

Just last week, I wasted two days of work trying to figure out how to encapsulate a code in an old environment developed only two years ago and broken after two years by the updates of the libraries. I spent all the time allocated to that project (I needed to add a new feature to a server we were releasing) to have the code working again. Scientists do not have time to deal with these things, and this is the reason why still most people in our environment prefer Fortran to Python: a code written in the '70s still works and compiles on modern Fortran compilers (you do not require people to freeze their environment to the '70s). Why does a code written two years ago not work on the new Python interpreter and need continuous updates to adhere to new standards without implementing any bug fixes or functionality?

I agree that adhering to the new standard is good, and we will try to adopt them in the SSCHA. For this, pull requests are welcome! But as the leading developer, I do not have any time to work on that (all the time I spend on the code is for bug fixes or new functionalities, which are for our users, collaborators, and myself, are the highest priority).

I also do not want users to rewrite all their scripts just because we release SSCHA 2.0. From personal experience, what happens is just that people will not update the code. I have suffered a lot from backward compatibility issues, and I do not want to put users of this code under the same pain. For this, adding a function in each module like the one I cited does not make the code more heavy. But the old API is callable even if the code is ported to new standards (at no point inside the source we would have the GetThisAndThat, but a script calling GetThisAndThat would implicitly call get_this_and_that under the hood).

The point is not 50 lines of a script, but imagine a user preparing the scripts to automatize something with the SSCHA and using it for 2-3 years. Then he/she changes the PC, installs the SSCHA, and wants to run some simulations quickly (crunch time). Nothing works because the new code installed is a new version breaking compatibility. Maybe he/she is just a user with little-to-none experience in programming. How much time will he/she lose to understand that the version needed is different? They may be using a new version of numpy exploiting some new functionality, which is incompatible with the old SSCHA employed in the original scripts, thus making an effort to understand what needs to be changed and port their code, all of this maybe when they have other things to do because a deadline is approaching. This happened so many times to me.

A code should be forever. Or at least for 10-20 years.

bastonero commented 11 months ago

Hi Lorenzo, I understand your points. Although, just notice that PEP8 was proposed 22 years ago.

Re Fortran viewpoint: Fortran is a very simple code language, not meant for interoperability, and many things of Fortran eventually slowed down development at a certain point. I guess that's also why most of the people are choosing Julia or other languages nowadays.

My opinion is rather the opposite: a code is always evolving, getting better and improving, and code versioning is allowing to do so. And if the old user just want an older version of your code, it can simply be installed with the older versions of python. One cannot just look at the worst users' case scenarios, but one should also consider all the other benefits that come along.

This said, I am not here on board to make priority decisions, and possibly a bit biased on the usability/development side. I just think that the simple PEP8 would make many things much easier (e.g. interface with other codes).

eli-schwartz commented 11 months ago

Hi Lorenzo, I understand your points. Although, just notice that PEP8 was proposed 22 years ago.

PEP8 being proposed 22 years ago doesn't really do anything to help people today who have written correct code, and discover tomorrow that by then the code is incorrect.

Although python as a language changes much more frequently than fortran, the flip side is that it offers deprecation periods where the old version of everything works, but is warned against at runtime. Users have a couple years to switch over. I think this is pretty important -- telling people "oh, nothing works now and everything exploded, your only hope is to immediately research what has been renamed and figure out how to update that and then roll out updated versions of your code to use the new approach" is terrible and clumsy.

I am highly skeptical that backwards compat functions which people do not use are going to slow down their code.

Backwards compat functions which people do use slow down their code a lot. Before, their code crashed instantly, now their code actually does something, and "doing something" takes time. I think that on the whole, users will accept that tradeoff!

mesonepigreco commented 11 months ago

I see your point. We will implement the PEP8 standard in the code and move to meson to compile the Fortran source. As I mentioned, there are two caveats:

So pull requests are welcome!

Also, we must consider the average user of the code, which is likely to have little to no experience with Python or programming in general. To avoid code slowdown we could implement conditional checking on the naming of functions within the __debug__ flag (if it is still supported by PEP standards XD) or assert statements, and print a DeprecationWarning.

bastonero commented 11 months ago

If you open a new branch here in the main repository I can mirror it in my fork and work out the PEP8. I think this should be somehow fast to do, and hopefully the tests will run. Moving to meson seems indeed much more involved. Moreover, it is less urgent as it will be only a problem for python 3.12.

Re backwards comp., here my ideas, but there could be better ways I guess

  1. The modules will have to be re-named, hence I think it's very hard NOT to duplicate files. e.g sscha.Minimizer is related to Minimizer.py. To the best of my knowledge, if we want to rename it to sscha.minimizer while keeping sscha.Minimizer we should duplicate Minimizer.py into minimizer.py. Then minimizer.py will stick to PEP8, while Minimizer.py will contain only code calling minimizer.py.
  2. To print the warnings, we can simply use a decorator (simple example). To make them not to be printed multiple times during the workflows (as some functions may be called multiple times) we can just set the level of warning low (other details). Qiskit seems to handle this nicely (here). We could also kind of copy paste some of their deprecation functionalities. Same story for modules and classes.

Since this effort is to comply with nowadays standards, I would suggest to support a pre-commit for the refactored and future code. In particular:

  1. Use ruff as linter, which enables to improve the code.
  2. Use isort and black (and maybe flynt) as formatters.
  3. Use pydocstyle to check docstrings.
  4. Consider mypy for static typing checks, that could be very interesting in the future (but less urgent).

These will make development much easier, e.g. pydocstyle would already check that docstrings are correctly formatted (and are present as well), isort and black will automatically format the code to be "prettier", and the linter ruff will help making the code cleaner, safer and possibly faster.

Let me know if this sounds good to you (:

eli-schwartz commented 11 months ago

1. The modules will have to be re-named, hence I think it's very hard NOT to duplicate files. e.g sscha.Minimizer is related to Minimizer.py. To the best of my knowledge, if we want to rename it to sscha.minimizer while keeping sscha.Minimizer we should duplicate Minimizer.py into minimizer.py. Then minimizer.py will stick to PEP8, while Minimizer.py will contain only code calling minimizer.py.

You could do it with a meta import hook: https://docs.python.org/3/reference/import.html#the-meta-path

2. To print the warnings, we can simply use a decorator (simple example). To make them not to be printed multiple times during the workflows (as some functions may be called multiple times) we can just set the level of warning low (other details).

You don't strictly need a decorator, those primarily exist to reduce repetitive boilerplate. Just warnings.warn() at the top of the function, really.

For warning only once per function/class, you probably want to install a warning filter, see the "once" action at https://docs.python.org/3/library/warnings.html#the-warnings-filter.

Since this effort is to comply with nowadays standards, I would suggest to support a pre-commit for the refactored and future code. In particular:

I think that using code style linters is pretty irrelevant for users, for the most part, as long as the public API is easy to understand. It's also mostly not pertinent when it comes to PEP 8. (In particular, black violates PEP 8.) Although that might still be desirable, it probably makes sense to create a separate github issue for it.

and the linter ruff will help making the code cleaner, safer and possibly faster.

Cleaner, probably. Faster, well, most things ruff can do won't really affect speed, but some ruff linting rules will actually make the code a bit slower and some a bit faster. Safer? I think it's rather stretching the point to suggest that ruff fixes your code from security vulnerabilities. (Are there currently concerns about that here?)

bastonero commented 11 months ago

You could do it with a meta import hook: https://docs.python.org/3/reference/import.html#the-meta-path

It seems more work than just duplicate the files. Won't it be faster for the future having the old files with the old names, so one can just delete them once their time has come?

You don't strictly need a decorator, those primarily exist to reduce repetitive boilerplate. Just warnings.warn() at the top of the function, really.

Indeed, it seemed to me easier to have a decorator since the entire code has to be deprecated :D

For warning only once per function/class, you probably want to install a warning filter, see the "once" action at https://docs.python.org/3/library/warnings.html#the-warnings-filter.

Yeah, that's what I meant.

I think that using code style linters is pretty irrelevant for users, for the most part, as long as the public API is easy to understand. It's also mostly not pertinent when it comes to PEP 8. (In particular, black violates PEP 8.) Although that might still be desirable, it probably makes sense to create a separate github issue for it.

I mean, this entire refactoring is meant more for the developers more than for the users. It is extremely hard to maintain a code where each file has more than 3000 lines and few to 0 conventions.

Not that I particularly like black, but at least it formats automatically the code in a more or less uniform way.

Cleaner, probably. Faster, well, most things ruff can do won't really affect speed, but some ruff linting rules will actually make the code a bit slower and some a bit faster. Safer? I think it's rather stretching the point to suggest that ruff fixes your code from security vulnerabilities. (Are there currently concerns about that here?)

Pardon, I meant safer in the sense of being more reliable and trustable.

All these changes are again more for developers than for users. If the code is more readable and cleaner, many things are just easier (e.g. spotting bugs, maintain, etc.). At least, this is my personal little experience.

eli-schwartz commented 11 months ago

It seems more work than just duplicate the files. Won't it be faster for the future having the old files with the old names, so one can just delete them once their time has come?

A meta hook could avoid confusing readers who see a bunch of extra files and have to figure out whether they are important. :) It is also easier to delete, since you just delete a couple lines from __init__.py to stop installing the import hook.

Not that I particularly like black, but at least it formats automatically the code in a more or less uniform way.

Hmm, curious if you've looked at https://github.com/google/yapf#readme. There is also autopep8 of course...

bastonero commented 11 months ago

A meta hook could avoid confusing readers who see a bunch of extra files and have to figure out whether they are important. :) It is also easier to delete, since you just delete a couple lines from __init__.py to stop installing the import hook.

Ok, now I see actually, since we don't have to rename the classes here. Thanks, this sounds nice (:

Hmm, curious if you've looked at https://github.com/google/yapf#readme. There is also autopep8 of course...

Yeah, I indeed use yapf for my projects, but some collaborators of mine continue suggesting black. Don't know really the benefits of one more than the other. To me it would be already great to have at least one formatter. Then probably it comes to a matter of taste at a certain point?

@mesonepigreco if you like the idea, then we can try to keep the imports as @eli-schwartz mentioned, and then we can think of a decorator which prompts some standard deprecation message.

E.g.

@depracated_method(since=1.4, remove=2.0, new_method='get_this')
def GetThis(...):
    ...

So that it prints somethings like

The `GetThis` method is deprecated since 1.4 and will be removed in 2.0. Use `get_this` instead.
mesonepigreco commented 11 months ago

You could do it with a meta import hook: https://docs.python.org/3/reference/import.html#the-meta-path

It seems more work than just duplicate the files. Won't it be faster for the future having the old files with the old names, so one can just delete them once their time has come?

Wait, duplicating files or functions is really bad and a way worse violation of good coding habits than PEP8. This is going to be super confusing for both users and developers and destroy the git versioning of open pull requests. We can also choose to adopt only partially PEP8, and leave the names of the modules as they are.

For warning only once per function/class, you probably want to install a warning filter, see the "once" action at https://docs.python.org/3/library/warnings.html#the-warnings-filter.

Yeah, that's what I meant.

Why not add this function to all classes? If you want to be even more "elegant", just define a custom Object class with this function defined in it and then make all others inherit from it.

import re 

class SSCHAObject:
  def __getattribute__(self, name):
         # Substitute uppercase with lowercase and insert _ to split the words in name (snake_case)
         formatted_name = '_'.join(re.findall('[A-Z][a-z]*', name)).lower()
         # NOTE THAT IT DOES NOT WORK AS INTENDED AND SHOULD BE FIXED

         if name != formatted_name:
               warnings.warn(f"{name} is deprecated, use {formatted_name} instead.")

         # and call the function with formatted_name
         attr = object.__getattribute__(self, formatted_name)
         return attr

And just fix all the functions once to rule them all, it will avoid duplicating a lot of code. It should work. This seems way easier and cleaner than duplicating all the functions.

It can be coupled with my custom code to suggest similar keywords if a wrong attribute was set that Python implemented in 3.11

Maybe it would be better to define this on cellconstructor, and make sscha use this behavior.

I think that using code style linters is pretty irrelevant for users, for the most part, as long as the public API is easy to understand. It's also mostly not pertinent when it comes to PEP 8. (In particular, black violates PEP 8.) Although that might still be desirable, it probably makes sense to create a separate github issue for it.

I mean, this entire refactoring is meant more for the developers more than for the users. It is extremely hard to maintain a code where each file has more than 3000 lines and few to 0 conventions.

I think duplicating functions will make it much worse. The code has its own inner conventions. PEP8 is not a gold solution that solves all problems, in particular, if adopting PEP8 while keeping compatibility changes from 3000 to 6000 lines with tons of duplicated code, then we should not adopt it until we find a better solution.

Not that I particularly like black, but at least it formats automatically the code in a more or less uniform way.

Cleaner, probably. Faster, well, most things ruff can do won't really affect speed, but some ruff linting rules will actually make the code a bit slower and some a bit faster. Safer? I think it's rather stretching the point to suggest that ruff fixes your code from security vulnerabilities. (Are there currently concerns about that here?)

I would say the only "safety" concerns are on the Cluster module because it goes highly in hyperthreading to manage simultaneous connections and it was in the past affected by memory race issues in shared memory between threads.

Pardon, I meant safer in the sense of being more reliable and trustable.

All these changes are again more for developers than for users. If the code is more readable and cleaner, many things are just easier (e.g. spotting bugs, maintain, etc.). At least, this is my personal little experience.

How spotting a bug is easier if the name of a function change from DiagonalizeSupercell to diagonalize_supercell or by changing from 3 to 2 the number of white lines after the end of a function is a mystery to me. However, I agree that, if the naming convention is mixed, you need to guess which function you have to use while programming and using the API, which is bad.

mesonepigreco commented 11 months ago

The only potential problem I see is the overhead for calling it for each function call, as well as the potential not availability of the doc string for the deprecated functions. This could affect performance. Luckily, the code is adequately benchmarked, so we can easily measure how much of a burden this solution is.

bastonero commented 11 months ago

Wait, duplicating files or functions is really bad and a way worse violation of good coding habits than PEP8. This is going to be super confusing for both users and developers and destroy the git versioning of open pull requests. We can also choose to adopt only partially PEP8, and leave the names of the modules as they are.

True true, bad idea indeed!

Why not add this function to all classes? If you want to be even more "elegant", just define a custom Object class with this function defined in it and then make all others inherit from it.

I think this might apply only to certain cases. What about more compact method names? Won't be then weird? And what about the attributes? Should they also be updated?

I think duplicating functions will make it much worse. The code has its own inner conventions. PEP8 is not a gold solution that solves all problems, in particular, if adopting PEP8 while keeping compatibility changes from 3000 to 6000 lines with tons of duplicated code, then we should not adopt it until we find a better solution.

I guess adding three lines per method is ok overall. And then one can also choose new names for variables. It's not just a matter of capital letters.

How spotting a bug is easier if the name of a function change from DiagonalizeSupercell to diagonalize_supercell or by changing from 3 to 2 the number of white lines after the end of a function is a mystery to me. However, I agree that, if the naming convention is mixed, you need to guess which function you have to use while programming and using the API, which is bad.

PEP8 will do nothing indeed, I was referring to the introduction of linters and formatters. Overall one can get quite tired of scrolling many many lines, or lines with very different lengths; in a way then it prevents a bit to make mistakes. For sure it won't make easier per se, while only tests are a robust way in this sense.

mesonepigreco commented 11 months ago

Why not add this function to all classes? If you want to be even more "elegant", just define a custom Object class with this function defined in it and then make all others inherit from it.

I think this might apply only to certain cases. What about more compact method names? Won't be then weird? And what about the attributes? Should they also be updated?

Attribute names already follow PEP8 (I believe), the real big problem only regards function names (which, as you mention, are inconsistent throughout the code and can create confusion).

I think duplicating functions will make it much worse. The code has its own inner conventions. PEP8 is not a gold solution that solves all problems, in particular, if adopting PEP8 while keeping compatibility changes from 3000 to 6000 lines with tons of duplicated code, then we should not adopt it until we find a better solution.

I guess adding three lines per method is ok overall. And then one can also choose new names for variables. It's not just a matter of capital letters.

True, this also avoids any overhead. Is there a decorator to skip generating the documentation for those functions in the API guide? Or can we link to the documentation of the good functions to avoid duplicating the docstrings?

How spotting a bug is easier if the name of a function change from DiagonalizeSupercell to diagonalize_supercell or by changing from 3 to 2 the number of white lines after the end of a function is a mystery to me. However, I agree that, if the naming convention is mixed, you need to guess which function you have to use while programming and using the API, which is bad.

PEP8 will do nothing indeed, I was referring to the introduction of linters and formatters. Overall one can get quite tired of scrolling many many lines, or lines with very different lengths; in a way then it prevents a bit to make mistakes. For sure it won't make easier per se, while only tests are a robust way in this sense.

Probably it is good to add auto-linting for future commits. However, have a look at the new import In the Ensemble module for the Julia code which is quite complex. I made it so it should be as flawless as possible for the user (the code tries to install automatically the Julia dependencies if not present in the system). This is also true for not mandatory imports (some libraries are conditionally imported enabling some extra functionality of the code), which sometimes have a complex syntax not standard for PEP8.

bastonero commented 11 months ago

True, this also avoids any overhead. Is there a decorator to skip generating the documentation for those functions in the API guide? Or can we link to the documentation of the good functions to avoid duplicating the docstrings?

Good question, I believe so? I can have a look into this, but for sure there will be some smart way to do so. Possibly, also every method with that decorator generating a "Deprecated" section in the documentation, or something similar.

We can also take the occasion to:

  1. Write uniform docstrings
  2. Add/improve the typing (I think in the future it will be relevant)

If you open here a dedicated branch I can open a draft PR that pushes commits to that branch, so it will be easier afterwards to merge into the main one. I guess you will need to force push into the main one if we also rename the module files/folders. And then you can try testing it as well. We might to see if everything is working with just one module to begin with, and see what it does. What do you think?

We can start with python-sscha and then move to cellconstructor. I had a look into cellconstructor and it seems a bit more work is needed there. Moreover, python-sscha is not a cellconstructor dependency, so I think better to start from here.

mesonepigreco commented 11 months ago

I would do it progressively. First, I would not rename the file names for the modules, as this is going to be painful for the merge.

On docstring and typing, I agree. This is a priority as it is part of the documentation and user experience. We need also to choose if to adhere to the RST format or Markdown. I believe markdown is more widely supported, but most, if not all, of the docstrings are written in RST.

The new functions I wrote have the type linting. However, what is the first Python version that employed listing? Do we need an explicit future import in older version of Python? We need support from Python 3.6 to above. In principle, I think the code still supports Python 2, however, I have not checked the compatibility, and the test suite is not running anymore for Python 2, so it is possible that we already accidentally broke the backward compatibility to that (and at this point probably it is pretty safe to assume python 3 is the standard).

At a certain point, the test suite was running for the versions of Python 2.8, 3.6 to 3.10. However, I reached the computational limits on my GitHub account, so I had to limit the number of versions for the automated test suite run.

Anyway, I have followed a convention since CellConstructor and python-sscha are significantly linked; minor versions of the two libraries should always match to ensure compatibility. So I will release a 1.4 of CellConstructor contemporary to the 1.4 of python-sscha even if minimal changes occurred to cellconstructor. The reason is that sometimes for needs of speedup, we must add an extra parameter to one of the cellconstructor calls, so python-sscha 1.4 needs an updated version of cellconstructor. To make life easier, instead of keeping track of which version of sscha is compatible with which version of cellconstructor, I adopted the conversion that the minor number should match (this is eventually easy to spot and warn or alert the user to update) Therefore, the update in a minor version must co-occur for both.

eli-schwartz commented 11 months ago

The new functions I wrote have the type linting. However, what is the first Python version that employed listing? Do we need an explicit future import in older version of Python? We need support from Python 3.6 to above.

The typing module is available since python 3.5, however if you want to use from __future__ import annotations to make those annotations into almost-zero runtime overhead stringized annotations then you will need python 3.7.

bastonero commented 11 months ago

For testing multiple versions of python one can use (locally) tox. But indeed, I think python 2 is quite deprecated anywhere now.

Is there any particular concern on python 3.6 as a lower limit? Probably it would then be good to exploit the 3.7 feature?