biopython / biopython

Official git repository for Biopython (originally converted from CVS)
http://biopython.org/
Other
4.31k stars 1.74k forks source link

Adopt code formatting tool black for code style? #2008

Closed peterjc closed 4 years ago

peterjc commented 5 years ago

(Spin out from discussion with @MarkusPiotrowski on #2005)

The Python community has standard coding style conventions described PEP8 and PEP257 which the Biopython project tries to follow (although many of our historic modules, class, and function names etc break this, here we are constrained by backwards compatibility).

https://www.python.org/dev/peps/pep-0008/ https://www.python.org/dev/peps/pep-0257/

We have gradually been fixing and then enforcing various PEP8 and PEP257 style violations using the flake8 command line tool with various plugins. Some of these style changes are a little tedious to perform by hand, and autopep8 only handles some of them.

An alternative is a relatively new Python code formatting tool, black "The uncompromising Python code formatter":

https://github.com/ambv/black

The name is a reference to Henry Ford,

https://en.wikipedia.org/wiki/Any_Colour_You_Like

I have been using this on some other projects, and wrote a flake8 plugin for it:

https://github.com/peterjc/flake8-black

This is easier as a coder - just integrate black into your editor setup, and/or your git pre-commit hook, plus the continuous integration configuration.

This is also easier as a project maintainer or contributor - rather than explaining how to fix a PEP8 style warning, just run black. Indeed, it should be possible to have a bot monitoring pull requests and doing this automatically.

While we as a group are unlikely to unanimously agree with all the style choices embodied by this tool (indeed, I don't like all of them myself), it is enormously liberating to stop worrying about where to put white space and line breaks and just delegate this to the tool.

Note that there are rare occasions where we have deliberate line breaks, and in those few corner cases there are special comment lines to turn off black for a single file, or a code block. For example, the dictionaries in Bio/Data/CodonTable.py which have four codons per line.

We could agree to apply black on a module level, but we would get most benefit out of the tool if we simply apply it to the entire codebase.

I'm opening this issue for a discussion - do people think using black would be a good idea, don't really care - or dislike the style it uses strongly enough to object?

MarkusPiotrowski commented 5 years ago

Although I strongly dislike how black is doing line continuations (not using indentation on parenthesis, and putting each element on a single line, including the closing parenthesis), I have to admit that using black would really make comitting much easier. How often do we see PRs that have to be corrected several times because of style violations? Obviously, people don't like pre-commit hooks too much... black would take over much of this work and would take much work from @peterjc 's shoulders.

I vote four applying black on the entire codebase (step-by-step, e.g., for each new commit).

peterjc commented 5 years ago

@MarkusPiotrowski Could you clarify what you had in mind for a step-by-step adoption?

I was thinking doing it all at once would be simplest:

  1. See which files or bits of code need marking to be excluded (e.g. Bio/Data/CodonTable.py) and update those files with the special comment lines (if we miss some, we can do this later too)
  2. Run black on the entire code base and commit this (either as a single big commit, or perhaps one commit per folder), with the authorship set to something like black@example.org to avoid misleading user contribution statistics.
  3. Add black --check or my flake8 plugin in TravisCI and AppVeyor
  4. Later explore if there is a good bot we can use to run black on pull requests

There will be a few other corner cases to sort out, like the automatically generated code for Bio.Restriction, but even there we can easily run the self-update and then run black, so it isn't a big issue.

peterjc commented 5 years ago

Actually, having looked at the black issue tracker, there is a risk we could run into some annoying corner cases - so perhaps starting with applying black on a module by module level is more prudent.

Also, we would we wise to pin the version of black we're using, and only periodically update that (with any associated code changes).

@mdehoon What are your thoughts? Note that there is a possible compromise in that some files could opt-out by adding the # fmt: off line near the start.

mdehoon commented 5 years ago

@peterjc Sorry, no thoughts here

MarkusPiotrowski commented 5 years ago

My idea was that we start using it only for new commits and collect some experience with it over time. I mean, the most important thing about black is not that our complete codebase is 'blackened' but that each commit will automatically be PEP8-confirm? Even if our codebase is completely 'black-ish', this wouldn't prevent users from submitting style-violating code.

I thought it is possible to run black as a sort of pre-commit hook on Github so that it automatically converts pull requests in the process of submitting. Turned out that this is not possibly yet: https://github.com/pre-commit/pre-commit.github.io/issues/211 https://github.com/Mariatta/black_out

So maybe we should enforce a little bit more using pre-commit hooks by our contributors (will open a separate issue for this).

At the moment, I don't see too many advantages if our codebase is black-ened, but contributors can still submit style-violating code. In worst case, they may revert changes that have been introduced by black (because they look so ugly ;-) ).

peterjc commented 5 years ago

My idea was that we start using it only for new commits and collect some experience with it over time. I mean, the most important thing about black is not that our complete codebase is 'blackened' but that each commit will automatically be PEP8-confirm? Even if our codebase is completely 'black-ish', this wouldn't prevent users from submitting style-violating code.

Black works at file level - you can't (as far as I know) even restrict the style changes to a single function/method, for example the bits being updated in a pull request.

Consider the extreme example of someone fixing a typo in a docstring, then makes a small pull request - do we run black on the entire file at that point?

As an alternative, how about we start by looking at the impact of running black on the Tests/ folder? This is good in several ways - right now we have spent less effort on the code style here (the Bio/ and BioSQL/ folders are far better in terms of flake8 compliance), and also the changes are not going to be as disruptive to any open pull requests or enhancements people are working on right now.

I thought it is possible to run black as a sort of pre-commit hook on Github so that it automatically converts pull requests in the process of submitting. Turned out that this is not possibly yet: pre-commit/pre-commit.github.io#211 https://github.com/Mariatta/black_out

Yes, I'd seen those prototypes - right now I don't think there is a ready to use way to automatically run black on pull requests. Perhaps that will be the point at which this becomes most attractive?

So maybe we should enforce a little bit more using pre-commit hooks by our contributors (will open a separate issue for this).

Worth encouraging. Note our current folder level .flake8 settings are a handicap here - a bonus of running black on the entire code should mean we can have a single top level .flake8 setting for the entire project, and then can just recommend installing the default flake pre-commit hook. See #1188.

At the moment, I don't see too many advantages if our codebase is black-ened, but contributors can still submit style-violating code. In worst case, they may revert changes that have been introduced by black (because they look so ugly ;-) ).

Yes - that's why as soon as the entire code base is black compliant, we should enforce this via TravisCI and AppVeyor for continuous integration and all new pull requests.

(And we could do this at folder level too, e.g. if we start with making Tests/ follow the black style, we can also start enforcing that)

MarkusPiotrowski commented 5 years ago

My idea was that we start using it only for new commits and collect some experience with it over time. I mean, the most important thing about black is not that our complete codebase is 'blackened' but that each commit will automatically be PEP8-confirm? Even if our codebase is completely 'black-ish', this wouldn't prevent users from submitting style-violating code.

Black works at file level - you can't (as far as I know) even restrict the style changes to a single function/method, for example the bits being updated in a pull request.

Consider the extreme example of someone fixing a typo in a docstring, then makes a small pull request - do we run black on the entire file at that point?

Yes, that was the idea, however, this was when I thought that black would automatically re-write the file during the PR, which is not the case.

peterjc commented 5 years ago

I opened pull request #2079 (corrected) applying black to the GenomeDiagram code, picked since the author @widdowquinn is a black convert, and there are no active changes pending here (IIRC). While I'd be happy to merge this, this is more intended as good example of the kind of things black would do to the Biopython code base.

One specific pain-point on this module was black discarded lots of comment alignment done with white space.

peterjc commented 5 years ago

The concrete example of the changes black would make on #2029 have spurred some good discussion. It seems not everyone is keen on this.

I did consider a possible compromise which would let us apply the changes gradually:

Individual module authors/maintainers can apply black formatting to their files if they wish (reverting any minor issues to ensure flake8 passes with the project wide settings - currently only expected to apply to a false positive in E203).

If over time a large proportion of the code base started to follow the black style, we could look at selectively enforcing this on those modules, and ultimately perhaps apply it everywhere.

However, in the short/medium term this would lead to very different visual style across the project, which would surely be confusing - especially for new contributors.

I'd much rather apply black everywhere - and am content to come back to this discussion in say six months time?

widdowquinn commented 5 years ago

I wonder if the (any colour as long as it's) black approach works best for smaller codebases, nearer the start of their life, and with less legacy code? For larger projects like this, maybe there needs to a lot of buy-in from developers, and a fair amount of person-time dedicated to managing the changes, unless black is used from the get-go?

As a user (but not an evangelist) I like black. I think it makes my code look better, and where it introduces weird-looking or less readable layouts, prompts me to write better code. This is all fine for my smaller projects, and while I'm developing things that are new.

But when there's a codebase as large as Biopython's, with the potential to make sections of code less friendly, unilateral introduction could be a bad idea, and discouraging of community participation. I wonder if this would need a concerted, and possibly painful, sprint-type adoption with many eyes on the code, if it were to be applied all at once? Maybe a creeping adoption, module-by-module, is the most practical way to introduce it?

peterjc commented 5 years ago

If that is the consensus, we could make this a goal at the upcoming BOSC CoFest 2019, which also welcomes remote participants. @JoaoRodrigues is attending as am I, and a couple more people who mentioned Biopython - so we'd be able to have eyes on the code. https://www.open-bio.org/events/bosc/collaborationfest/

peterjc commented 5 years ago

A group of about five or six people discussed this on the second day at the BOSC2019 CoFest, including @sbliven and @hmenager (who else? @jgreener64 and @JoaoRodrigues were you in that discussion?), and went over the GenomeDiagram example on PR #2029.

I would summarise this most of us disliked something in the black style (often the tendency to use more lines than really necessary), but saw greater value in delegating the style convention to an external tool (and its authors), and making it much easier to bring contributions into compliance (just run the tool).

We agreed that the changes would need reviewing (mainly for comment placement, since we do not expect any functional changes), and that multiple pull request doing this for top level folders seemed reasonable. We could then start enforcing this via git pre-commit and continuous integration via our flake8 settings.

We agreed that future enhancements using GitHub automation would be welcome, but not required at this point (e.g. automatically run black on new pull requests?).

Update: To be clear, "we" in the above is my summary of the in person discussion. Not necessarily what "we" the Biopython community as a whole will agree to do.

peterjc commented 5 years ago

One of the changes black makes is to preferentially use double quotes for strings, which we can do on its own with a tool like unify, https://github.com/myint/unify

Note it doesn't support folders, so you can do it in stages, e.g.

$ unify --quote \" --in-place Bio/*.py BioSQL/*.py Tests/*.py
$ unify --quote \" --in-place Bio/*/*.py
...

It seems helpful to apply those changes first, as this would make the changes from black smaller and thus easier to review?

peterjc commented 5 years ago

As a step towards adopting black, I've made a proof of principle pull request doing just the double quote changes - it turned out to be quite monster! See #2192

This would in particular affect code from @mdehoon @brandoninvergo @bow @lijax (and others), who evidently once or still prefer single quotes. What do you think (of the quotes, but also what black would do to your code)?

brandoninvergo commented 5 years ago

@peterjc: I still have the completely irrelevant and inadvisable habit of putting single-quotes around single characters and double-quotes around longer strings...like C char vs char*. I'm probably even inconsistent about it. So, no qualms from me about the changes. Overall, I like what I see.

peterjc commented 5 years ago

Copying Tiago's comment from #2001

On an aestehic level, I would prefer the opposite direction.

But the important point is to have a convention and stick with it. If double quotes, then double quotes it is.

Points in favour of single quotes: Python's repr(...) uses them for strings. The official Python tutorial seems to prefer them https://docs.python.org/3/tutorial/

Points in favour of double quotes: Black uses this. It works for copy-and-paste from our PDF tutorial. It seems to be more common in our code base.

These grep searches can surely be improved, but if indicative the split is less dramatic than I expected:

$ cat Bio/*.py Bio/*/*.py Bio/*/*/*.py BioSQL/*.py | grep -c  '\".*\"'
30177
$ cat Bio/*.py Bio/*/*.py Bio/*/*/*.py BioSQL/*.py | grep -c  "\'.*\'"
19756
MarkusPiotrowski commented 5 years ago

Points in favour of single quotes: Python's repr(...) uses them for strings. The official Python tutorial seems to prefer them https://docs.python.org/3/tutorial/

That's what I don't like about Black, it's style looks very different from the code examples in the Python documentation...(including PEP8). Whereby quotation marks are still the lesser evil.

Another point in favour of double quotes: You can use the single quote within a string without escaping or changing the outer quotes.

peterjc commented 5 years ago

Personally I like the double quotes (probably because I learnt C earlier).

Note both black and unify allow switching quote style when you need to include the other quote without slash escaping it, so I don't see this as a point in favour of one or the other.

https://black.readthedocs.io/en/stable/the_black_code_style.html#strings

Interestingly black currently has two configurable settings (aside from Python version), the line length, and --skip-string-normalization to not touch the quotes (intended for legacy code bases only).

@MarkusPiotrowski would you be happier with us adopting black --skip-string-normalization (and leaving the current mixture of single and double quotes for another time)?

MarkusPiotrowski commented 5 years ago

I meant that it is more likely to have an apostrophe in a string and then, if you prefer single quotes, you have to escape it (looks ugly) or change the quotes.

Anyhow, I totally agree with the overall aim to get the style of the codebase more uniform and I don't have problems with using double quotes.

Personally I like the double quotes (probably because I learnt C earlier).

Funny, I like single quotes because I learnt Basic and Java earlier and Python was different...

widdowquinn commented 5 years ago

Note both black and unify allow switching quote style when you need to include the other quote without slash escaping it, so I don't see this as a point in favour of one or the other.

I've come across this in the context of regular expressions, where I've wanted to match a double-quote and the simplest option is to use single-quotes for the string. black doesn't complain or modify the code.

sbliven commented 5 years ago

Let's not get too bogged down in any particular style choice here. The central point as I see it is whether the conformity represented by black (both conformity within BioPython and conformity to the trends in the larger python community) is worth the inevitable places where manual formatting is superior to automatic formatting.

After the discussions at CoFest I support moving to black.

+1

peterjc commented 5 years ago

If there are no objections (e.g. @MarkusPiotrowski & @mdehoon?), I propose we start running black on a module by module level, with pull requests for human review - particularly for things like comment placement, and in rare cases adding the special comment lines to disable black for a region.

(Also we should start ignoring flake8 violation E203 pending a fix for https://github.com/PyCQA/pycodestyle/issues/373 to avoid false positives)

Once we've done the whole code base, we can start enforcing this with the continuous integration checks.

After that, perhaps there will be a good automation option available to help with automatically formatting pull requests.

peterjc commented 5 years ago

OK, requesting any objections by the end of August 2019, here or on the mailing list: https://mailman.open-bio.org/pipermail/biopython/2019-August/016661.html

svalqui commented 5 years ago

Is there any agreement in the max line length? is it going to be left as default 80?

blaiseli commented 5 years ago

I would just like to point out that black does not handle certain types of comments well: https://github.com/psf/black/issues/379#issuecomment-400958330

It is sometimes useful to be able to comment individual elements in a list/tuple/dict/..., and putting the items on separate lines makes this much easier. As far as I'm aware of, black seems to mess with this kind of commenting technique.

Also, writing container items on separate lines makes code diff easier to read (and maybe even simpler to handle by git). I'm not sure the kind of code "reflowing" shown in the linked issue is harmless in this regard.

peterjc commented 5 years ago

Yes, one of the key things we'll be looking out for in the manual reviews is comment placement. The rare case of commenting parts of data structures may justify temporarily turning off the black formatting with the special comment lines.

peterjc commented 5 years ago

I thought I replied about line length: I am assuming we'd use the black default of 88.

Currently we don't enforce E501 with flake8 as there are too many historic long lines - after running black would be a good time to review that. The flake8 max line length setting could be set higher, e.g. 120, if that was seen as sensible.

sbliven commented 5 years ago

Regarding the line length, here's some data!

Cumulative distribution of line lengths over all python files: linelengthscumulative

Table with some cut points we've been discussing:

Range Lines Percentage
(-1, 0] 32505 12.3
(0, 80] 212528 92.9
(80, 88] 7003 95.6
(88, 120] 7024 98.2
(120, 82749] 4627 100
Total 263687

I'm a little impressed that test_SeqIO_AbiIO.py has a line with 82'749 characters! I guess black will reformat that into an array of 10'000 lines, but to me that makes it a good example of black highlighting a flaw in our code design.

Gist with the analysis script


For my 2¢, I'd stick with the black default.

peterjc commented 5 years ago

Can you re-run the line lengths for just Bio/ and BioSQL/ please? That is probably more informative. If it would be painless to do this before and after running black, even better - that will highlight any long strings which black currently does not break up.

To be fair, the tests do have a lot of constants for expected values - but now that you've highlighted it, this case is excessive. It didn't look so horrible when I reviewed in GitHub, but most of those could be changed to check just the start of the data easily enough?

https://github.com/biopython/biopython/blob/biopython-174/Tests/test_SeqIO_AbiIO.py#L347

        self.assertEqual(record.annotations["abif_raw"]['DATA2'], (-22, -2, 6, 0, -3, -11, -10, -9, -17, -11, ...))
        self.assertEqual(record.annotations["abif_raw"]['DATA3'], (16, -8, 7, -21, 5, 0, 12, 4, -7, 6, -7,...))
        self.assertEqual(record.annotations["abif_raw"]['DATA4'], (-7, 4, -13, 8, -7, -1, -6, -6, -1, -6, ...))
sbliven commented 5 years ago

Here's everything split apart. I forgot about BioSQL in the last comment, so that was just including Bio and Test.

Bio

Range Lines Percentage
(-1, 0] 23173 15.6
(0, 80] 122418 97.7
(80, 88] 1909 99
(88, 120] 1219 99.8
(120, 2331] 250 100
total: 148969

BioSQL

Range Lines Percentage
(-1, 0] 444 15.1
(0, 80] 2467 99.2
(80, 88] 22 100
(88, 92] 1 100
total: 2934

Tests

Range Lines Percentage
(-1, 0] 9332 8.1
(0, 80] 90110 86.7
(80, 88] 5094 91.1
(88, 120] 5805 96.2
(120, 82749] 4377 100
total: 114718
sbliven commented 4 years ago

Have we reached a concensus on black? If so I'd be happy to do a PR for Bio.Pdb alá #2251.

peterjc commented 4 years ago

There have been no objections, so if you want to give a quick look at #2251, we can merge that, and then start working on individual modules.

peterjc commented 4 years ago

@sbliven #2251 has been merged, you want to do PDB now?

peterjc commented 4 years ago

For anyone wanting to help: Please try to use separate commits for the black automatic changes, and your own fixes (could be commits before running black like moving comments, or afterwards like changing break points in string literals).

e.g. https://github.com/biopython/biopython/pull/2264#issuecomment-530958184

sbliven commented 4 years ago

There's now a github bot in development called black_out. It will automatically run black on any PR that gets tagged 'black out'.

hwalinga commented 4 years ago

I would advise to use black as it is intended: That is only apply black on whole files and not work with files that are partially black formatted and partially not.

However, if there is ever a situation you just want to format a part of the file, there is https://github.com/wbolster/black-macchiato Just sharing my knowledge.

svalqui commented 4 years ago

I'll work on Bio.SearchIO.

svalqui commented 4 years ago

I'll work on Bio.Phylo

peterjc commented 4 years ago

One of the most common things requiring manual changes after running black is fixing implicit string concatenation within a line of code, see https://github.com/psf/black/issues/26

On #2359 in addition to running black on BioSQL/ and fixing the existing concatenation problems (both from black and pre-existing ones), I propose catching new concatenation problems from black by using a new flake8 plugin: https://github.com/keisheiled/flake8-implicit-str-concat

peterjc commented 4 years ago

For anyone working on this issue, I strongly recommend you install flake8-implicit-str-concat locally and run flake8 after black to catch any string literal problems - the most common thing we've had to hand-correct with the automated reformatting. As of #2359 this is checked on TravisCI.

peterjc commented 4 years ago

The black updates on the application wrappers have drawn my eye to the problematic use of triple-quote strings, something that the black reformatting made more obvious - see #2369

peterjc commented 4 years ago

Assuming pull request #2373 is merged, we will have applied black to all of BioSQL/, Scripts/ and Doc/examples/ and will start running the black check on those folders as part of the TravisCI checks.

Implementation wise, this uses flake8-black with the flake8 per-folder ignore settings to not require this on Bio/ and Tests/ (yet).

peterjc commented 4 years ago

I think we're over half-way now in terms of the files under Bio/ - which I find encouraging.

peterjc commented 4 years ago

With Emboss on #2470 and lots of miscellaneous files on #2498, the largest single easy group left is Bio.codonalign - any volunteers?

This will leave some auto-generated files and data heavy files which will need more human input.

Then finally, perhaps the hardest to review as it will be a lot of changes, everything under Tests/ - which might deserve its own issue as this one is getting rather long.

peterjc commented 4 years ago

Ah, already had #2467 for codonalign...

peterjc commented 4 years ago

I have opened issue #2552 for Tests/.

I think the currently open pull requests will take care of the rest of Bio/ at which point we can stop ignoring the flake8 black code for that directory, add something to the NEWS.rst file, and update the wording in https://github.com/biopython/biopython/blob/master/CONTRIBUTING.rst

peterjc commented 4 years ago

Thank you everyone - all of Biopython except the tests folder now follows the black style! 🎊

peterjc commented 4 years ago

I was reviewing GitHub actions for automatically applying black to pull requests (where typically the submitter ticks the box allowing edits from the maintainer, so this should be possible).

https://github.com/cclauss/autoblack https://github.com/lgeiger/black-action/pull/2 https://github.community/t5/GitHub-Actions/Can-t-push-to-forked-repository-on-the-original-repository-s/m-p/35916/highlight/true#M2372

It seems GitHub's security model block this between forks (and that's what we would need in Biopython). This kind of GitHub Action could still work nicely on a more literally shared repository.

This may mean we would instead have to use a more old fashioned bot, and give the bot sufficient write access to enable it to edit incoming pull request branches.

A helper script which Biopython maintainers can run locally as part of reviewing a pull request might be a low-tech solution (fetch the fork, checkout the branch, apply black, push the updated branch, updating the pull request).