apache / buildstream

BuildStream, the software integration tool
https://buildstream.build/
Apache License 2.0
84 stars 28 forks source link

Reconsider documentation style for private symbols #1321

Open BuildStream-Migration-Bot opened 3 years ago

BuildStream-Migration-Bot commented 3 years ago

See original issue on GitLab In GitLab by [Gitlab user @cs-shadow] on May 26, 2020, 21:47

Background

We currently do not document API private classes and functions using the standard docstrings, but in a special comment-based style. This means that they are unusable by IDEs or even the standard help() in a REPL (both of which I find quite useful personally).

Since Sphinx already does not include private symbols by default, I'm not sure if this buys us anything. Their documentation says that that the members option:

will document all non-private member functions and properties (that is, those whose name doesn’t start with _).

Isn't this exactly what we want? Or am I missing something obvious here?

I noticed this most recently in !1918, where there are proper docstrings for functions like buildstream.utils._make_protobuf_timestamp, and they do not appear in the output. You can preview the output at: https://buildstream.gitlab.io/-/buildstream/-/jobs/568430551/artifacts/public/buildstream.utils.html.

In my opinion, standard docstrings would help with the productivity of folks hacking on BuildStream, and I don't think we should underestimate the importance of that.

Curious to hear what others think about this.

Possible solution

I propose to phase out this comment style documentation in favor of standard docstrings. On a related note, I also think we can remove this custom logic from our makefile, since that's what Sphinx does by default anyway.

Implementation details

If we do decide that it's a good idea to do this, then there's also the matter of how to do the transition.

If we can automate this somehow, then ideally we'd run this as a one-off script. Everyone with exisiting branches will have to bite the bullet once, and do some potentially tedious rebases. But, if the script is idempotent then people could just run that script on their branches first, which would makes things much easier.

If the automated options seems unpractical, or isn't ready yet, I would propose a more incremental approach. A policy like "clean-up what you touch" might help do this gradually.

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @cs-shadow] on May 26, 2020, 21:58

mentioned in merge request !1918

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @tristanvb] on May 27, 2020, 11:29

Originally, I had avoided any docstrings as they add to our memory footprint for what appears to be absolutely no reason.

One might argue that you can use the python command line and introspect things; but:

I do feel that there is value to keeping things the way they are, even if memory footprint is not a concern, the main value I see is that we very obviously treat public, plugin facing API differently from core API which is allowed to churn from release to release; this distinction of internal and external APIs I think is extremely valuable.

If we change this, we might want to be much more strict about ordering of APIs in classes and modules, ensuring that if we do a migration (which I would hope would be atomic and not incremental), we have a very clear picture of what is public and what is internal.

On a related note, I also think we can remove this custom logic from our makefile, since that's what Sphinx does by default anyway.

Actually it does not really.

As it stands, sphinx will document any public method in the documentation for a plugin; including the plugin implementations of abstract methods for instance; those abstract methods need to be documented on Plugin, Source and Element respectively, we should not redundantly document them in plugins (or bloat the documentation with empty function declarations there).

Further I'd like to clarify that the only public API surface of any plugin is it's YAML configuration surface.

This is why we usually don't bother to use _ to mark private instance variables for instance in plugins, because a plugin's python implementation details are all inherently private (they are like their own individual private little python programs, they can rest assured they live all alone in their own little namespace).

This has unfortunately not always been well understood, and bad things like https://gitlab.com/BuildStream/bst-plugins-experimental/-/issues/2 have happened as a result, where plugins have made incorrect assumptions that for instance; a git plugin on a source of one of a given element's dependencies is actually the plugin it expects; but there is no guarantee at all that a plugin named git is plugin you think it is, or will respond to an API you think it might.

Exposing any python API in plugin documentation I think would have the unfortunate result of making it even more likely for projects to think that it is okay to access plugin specific APIs: The only way to interact with a plugin in python is by using core defined APIs which those plugins may implement in their own way.

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @cs-shadow] on May 27, 2020, 19:18

[Gitlab user @tristanvb] thanks for reading, some responses inline :)

  • BuildStream is not a library which gets important and used in this way

I'm talking about BuildStream developers here, not necessarily end users. As a developer, I absolutely find the docstrings useful where they are present (which is not a lot, but still).

  • Even if it were, I don't think we'd want to expose private docstrings

I'm not sure if I understand the downsides of having a docstring? Surely, if someone wants to look at the source, they can just read the comments as well. I definitely understand the desire for the private docstrings to be not inlcuded in the API reference, but I'm not sure how it hurts to have them.

I do feel that there is value to keeping things the way they are, even if memory footprint is not a concern, the main value I see is that we very obviously treat public, plugin facing API differently from core API which is allowed to churn from release to release; this distinction of internal and external APIs I think is extremely valuable.

I agree that we need to keep the public and private API separate, but why is the nomenclature of the symbols not enough? i.e. anything that starts with _ is considered private. I'm not sure why we also need to make the docstrings different.

Actually it does not really.

As it stands, sphinx will document any public method in the documentation for a plugin; including the plugin implementations of abstract methods for instance; those abstract methods need to be documented on Plugin, Source and Element respectively, we should not redundantly document them in plugins (or bloat the documentation with empty function declarations there).

Ok, fair, plugins are special I suppose where we don't want to document the actual classes. So, the makefile snippet will probably stay either way.

Exposing any python API in plugin documentation I think would have the unfortunate result of making it even more likely for projects to think that it is okay to access plugin specific APIs: The only way to interact with a plugin in python is by using core defined APIs which those plugins may implement in their own way.

By exposing, do you mean just having docstrings or having the docstrings rendered in the API reference? To be clear, I'm not saying we should expose all of this in the API reference. I'm just advocating to not prohibit docstrings. None of this would be visible in the API reference.

For example, take this bit of code from the utils module - https://gitlab.com/BuildStream/buildstream/-/blob/master/src/buildstream/utils.py#L128-154. It has proper docstrings for some of the private methods (probably by accident), and it doesn't appear anywhere in https://docs.buildstream.build/master/buildstream.utils.html.

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @tristanvb] on May 27, 2020, 19:45

I'm talking about BuildStream developers here, not necessarily end users. As a developer, I absolutely find the docstrings useful where they are present (which is not a lot, but still).

...

I agree that we need to keep the public and private API separate, but why is the nomenclature of the symbols not enough? i.e. anything that starts with _ is considered private. I'm not sure why we also need to make the docstrings different.

Because in BuildStream, I think things are complex enough that reading the code, there are many parts which feel public as they don't need to start with an _, we try to avoid _ when we're already deep in the internals, where we use _ mostly to differentiate "what is private to this module" rather than "what is private to BuildStream".

I feel the docstring/comment distinction has real value to treat these API categories very differently, and I would hope that if we changed this, we would have other means to make it astoundingly clear that public APIs which "must never change at all but only be very carefully extended" are very very different from internally public symbols, right now the docstrings I think make it very clear.

In your previous quoted comment above, you say that you find the docstrings helpful as a developer... but not a lot... so my question is; do you think it is worth the tradeoff ?

By exposing, do you mean just having docstrings or having the docstrings rendered in the API reference?

No no, here I was trying to respond to your comment about dispensing with the custom Makefile logic which treats plugins specially, hiding any python API from user facing plugin API... this comment was not intended to be related to the docstrings and is orthogonal.

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @cs-shadow] on May 27, 2020, 20:54

Because in BuildStream, I think things are complex enough that reading the code, there are many parts which feel public as they don't need to start with an _, we try to avoid _ when we're already deep in the internals, where we use _ mostly to differentiate "what is private to this module" rather than "what is private to BuildStream".

I feel the docstring/comment distinction has real value to treat these API categories very differently, and I would hope that if we changed this, we would have other means to make it astoundingly clear that public APIs which "must never change at all but only be very carefully extended" are very very different from internally public symbols, right now the docstrings I think make it very clear.

I'm not sure if I am understanding this correctly. Are you saying that we have private symbols where none of their parts begin with an underscore?

As far as I can tell, for each private symbol, either the module name begins with _ or the class/function name. For example, the full name of Stream would be _stream.Stream and hence won't be included in the reference documentation by default.

At least in my mind, the distinction between public and private API was always based on whether any parts of its full name begin with an _, i.e.:

I just assumed this was the case while starting out on BuildStream, but maybe we think this distinction is not clear enough?

No no, here I was trying to respond to your comment about dispensing with the custom Makefile logic which treats plugins specially, hiding any python API from user facing plugin API... this comment was not intended to be related to the docstrings and is orthogonal.

Thanks for clarifying.

I think we are getting close to the bottom of this, please bear with me :)

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @tristanvb] on May 28, 2020, 06:42

I'm not sure if I am understanding this correctly. Are you saying that we have private symbols where none of their parts begin with an underscore?

Not technically, but certainly apparently unless you pay close attention.

You make an example of _stream.Stream() in the toplevel directory, but there is also loader.Loader() in the _loader/ subdirectory, or queue.Queue() in the _scheduler/queues/ subdirectory for instance.

Granted, I'm quite intentionally not framing these things in the correct pedantic terms, and the above symbol name would really be _scheduler.queues.queue.Queue, but I think this is quite a subtle thing that is easy enough to lose track of while tunneling into writing code.

I just assumed this was the case while starting out on BuildStream, but maybe we think this distinction is not clear enough?

That is exactly my sentiment, it is nowhere near clear enough.

The subtlety of checking for a leading _ in a directory or parent directory or even filename of the class you happen to be editing right now is nowhere near as clear with regards to what is really public API...

With the docstring/comment distinction on the other hand, we remind the reader on a constant basis what is public vs internal, by dressing up public API like royalty (and I do consider the public API royalty in comparison, as it will become etched in stone so to speak for hopefully many, many happy backwards compatible years).

Of course; the audience of the code we write now is the developers who pickup the project many years later, I think we need to take every measure now to ensure that API doesn't accidentally break years after we have left the project.

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @juergbi] on May 28, 2020, 09:36

Could it make sense to use docstrings for all symbols but insert a word (maybe INTERNAL or PRIVATE) or sentence at the top of the docstring for non-public symbols to make it clear to the reader? I think that would be clearer than using different documentation styles and still allow IDEs/help() to work. Maybe there is already a convention for this but I have no idea.

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @BenjaminSchubert] on May 28, 2020, 09:36

Originally, I had avoided any docstrings as they add to our memory footprint for what appears to be absolutely no reason.

While I am worried about memory consumption, I think that if that is our memory hog, I'll be happy to live with it. In comparison to what the rest of BuildStream usually uses in terms of memory, this is not even .1%.

I also understand the advantage of having a separate style for documenting both public and private symbols in order to ensure the public API is treated differently.

However, I think that if we rely on the documentation style to guard us from breaking APIs, we are not doing the right thing, and I would rather see more strict unittests around the public APIs (even with enough mocking if needed) in order to ensure we don't break them unintentionally.

There is also another advantage of having real docstrings for everything there that I didn't see mentionned: we could start using a linter to ensure all our code is documented correctly with the correct parameters and such. It often happens that a parameter definition in the docstring is not in sync anymore because someone changed the code without updating the docstring, and we could remove that strain from the reviews by automating a lot of this.

And as Chandan said, I think one invaluable part of having normal docstrings everywhere is autocompletion and documentation in the IDEs for those who use one, as they would get much faster insight on what the method they are calling expects.

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @BenjaminSchubert] on May 28, 2020, 09:38

I like this idea a lot, and we might even be able to automate a check that ensures all our Public/Private API is documented accordingly

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @tristanvb] on May 28, 2020, 11:05

Another idea which springs to mind...

Consider the symbol lists which are typical to cross platform C libraries (I think they are an artifact of days where not every compiler supported a hint in the source code declaring if a symbol should be exported, I think msvc was a culprit of this).

This creates a situation where adding a public API symbol to the code would not export the symbol on the shared library's symbol table, unless the symbol was listed in a file (which typically lists all symbols alphabetically and is human maintained, which the compiler uses when creating the shared library).

This annoying bureaucracy of maintaining the symbols file in itself helps us to be more responsible.

Maybe if we had such a thing, which documented all public symbols in a list, as well as their calling signatures, we could have a linter fail if ever the python code was modified in such a way where existing symbols mismatch signatures in the database, or if new symbols were added/removed without updating the database (simple text file).

Anyway, I am definitely in favor of leveraging linters in ways which help us ensure compatibility going forward - and I would be happy to drop the docstring/comment approach to differentiating public/private; I just think we need to replace it with something (and a linting approach sounds much better I agree).

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @cs-shadow] on May 29, 2020, 17:33

To summarize, it seems like we've narrowed down exactly why we have this distinction. Which is primarily to separate public and private APIs as much as possible.

So, a proposal for changing these should also come with an equally good, if not better, mechanism to keep this separation. (We may not have this fully thought through just yet, but this doesn't sound inconceivable.)

And there's not any technical limitations on Sphinx side as far as I can tell. With the exception of managing plugin documentations in a slightly custom manner as we already do.

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @tristanvb] on May 29, 2020, 17:42

And there's not any technical limitations on Sphinx side as far as I can tell. With the exception of managing plugin documentations in a slightly custom manner as we already do.

For this part, I'm hoping to replace the Makefile fragment with a simple sphinx plugin to ship with BuildStream so that plugin authors can use this to generate consistently formatted plugin documentation.

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @cs-shadow] on May 29, 2020, 17:48

That'll be pretty neat. For example, in some of our internal plugin repositories, we didn't want to involve Make and just call Sphinx directly. So, we currently maintain a .rst file per plugin, each of which has just one line saying :automodule: plugin-name.

If this was a sphinx plugin however, we could use it out of the box.