pydrake: Produce coverage of current bindings

EricCousineau-TRI commented 5 years ago

Per prior discussion: https://github.com/RobotLocomotion/drake/issues/7889#issuecomment-499510945

(I mis-interpreted this from before) Per @ggould-tri's suggestion (I think), we should have some way to show what bindings are missing from the public Drake API.

Can reassign once this is closer to be worked on.

\cc @jwnimmer-tri

EricCousineau-TRI commented 5 years ago

Per f2f with @jamiesnape, a crude first approach would be to figure out what docstrings are not being used.

Can do this with the existing documentation_pybind.h and the existing source files.

m-chaturvedi commented 5 years ago

Script for producing symbols from pydrake.all: Here, Output. Output of cat documentation_pybind.h | grep -i 'Symbol: '

EricCousineau-TRI commented 5 years ago

Per Slack convo: Current code is just an initial rough estimate, comparing a certain subset of symbols in C++ to a subset of Symbols in C++.

Next step is to do symbol-by-symbol correspondence.

To expand on (what I think) Jamie said with an example, the correspondence script may not even need to know Python symbols; instead, it will just look at a source line in *_py.cc files, e.g.: https://github.com/RobotLocomotion/drake/blob/c6f0525c9c9745a6e61461a176cd6a26a9582a74/bindings/pydrake/math_py.cc#L52 and now that in context, cls_doc.ctor.doc_0args resolves to pydrake_doc.drake.math.RigidTransform.ctor.doc_0args. This will add one usage counter to the corresponding line in documentation_pybind: https://gist.github.com/EricCousineau-TRI/e03783a46ab214d4c99f9ad90acf5a81#file-documentation_pybind_snippet-h-L6

Perhaps the most robust way to do this (even for prototyping) is to hack mkdoc.py to emit the symbol names to a YAML file or something, which then the secondary script can load and use for counting symbol occurrences.

Additionally, for being able to resolve docstring alias assignments, e.g. constexpr auto& cls_doc, and their usage, e.g. cls_doc.doc, I would recommend just scanning line-by-line, and whenever you see an alias assignment, just update a table with it; then, when you come across a docstring, expand it using that table.

EDIT: One caveat is that it won't completely cover instantiations of generic templates, e.g. TypeIndex<Class> or Identifier<Class>.

m-chaturvedi commented 5 years ago

I could not get libclang to parse the doc name inside the .def function.

I tried castXML but it could not do it as well. Brad King said:

castxml does not generate any information about function bodies, so it is unlikely to work for your purpose.

I tried cppast to no avail. It also cannot parse function bodies. (cppast claims to give more information about the AST than libclang). Under Missing features:

Support for statements: currently function bodies aren't parsed at all;

I'm now working on modifying Pybind11 to get the mapping using the records it keeps.

EricCousineau-TRI commented 5 years ago

I could not get libclang to parse the doc name inside the .def function.

Can I ask if you've tried super simple text parsing? Per the suggestion above:

[...] the correspondence script may not even need to know Python symbols [...] I would recommend just scanning line-by-line, and whenever you see an alias assignment [...]

This shouldn't require libclang or anything to scan the *_py.cc files.

Can you try this first, before delving further into pybind11 internals?

EricCousineau-TRI commented 5 years ago

Also, might it be possible to have a first rough-cut of this (using basic text parsing, not C++ AST / libclang / castXML / cppast / etc.) by this Friday (2019/07/19)?

Let's say you point it at drake/math, and we see how much of that is covered by math_py.cc looking at each docstring? (This can be on a prototype branch, nothing has to be a PR / merged to master)

m-chaturvedi commented 5 years ago

Also, might it be possible to have a first rough-cut of this (using basic text parsing, not C++ AST / libclang / castXML / cppast / etc.) by this Friday (2019/07/19)?

Sure.

EricCousineau-TRI commented 5 years ago

Thanks! And I realized that my above statement may have had an imprecision there: I just meant to use basic text parsing for the resolution of the docstring symbol, not for producing the symbols in the first place (already done by mkdoc).

m-chaturvedi commented 5 years ago

This is an attempt in that direction.

EricCousineau-TRI commented 5 years ago

Next steps:

Output CSV, load into pandas, then look at stuff (percentages).
Decide on how to polish for merging into master to produce results.

m-chaturvedi commented 5 years ago

We made another attempt using the libclang tokenizer, here. (It has a csv printed using pandas)

EricCousineau-TRI commented 5 years ago

Before figuring out how to polish towards master, we should define an initial blacklist for coverage that we don't necessarily care about (e.g. //attic/..., etc.), internal API (e.g. MultibodyTree).

EricCousineau-TRI commented 5 years ago

Per f2f, I will review symbols, and give suggestions on what patterns to start filtering out.

EricCousineau-TRI commented 5 years ago

Here's the initial broad sweeping suggestions for symbols to ignore:

Any symbols coming from the following directories:
- /attic/
- /automotive/
Anything in an intermediate dev namespace
Anything in an intermediate internal namespace (should go away once we fully remove deprecated MBT bindings)
Symbols which use @exclude_from_pydrake_mkdoc (which shouldn't be incorporated in this analysis, but just in case...)

Can you incorporate these exclusions in your analysis?

m-chaturvedi commented 5 years ago

In order to address the first spec, i.e., remove attic symbols, we had to compare the header files in the source tree under the directory attic/ with that in the install tree, because mkdoc.py seems to be using files in the install tree. We do so by matching the names of the files and then matching the contents (in case multiple files in the install tree have the same name as the source tree).
Getting rid of Python bindings for attic is easy, we just don't parse bindings/pydrake/attic.
We had to remove symbols from both C++ side and Python side because we're looking for a percentage.
Updated Branch

jwnimmer-tri commented 5 years ago

The attic/foo/bar/baz.h should always be foo/bar/baz.h in the install tree. No need to compare file contents, if you use the package + basename instead of just basename.

EricCousineau-TRI commented 5 years ago

Per f2f, seems like current coverage is at 18% with this measurement step.

However, this only is for filtering attic. Does not filter internal, dev, automotive and maliput.

m-chaturvedi commented 5 years ago

This filters internal, dev, automotive and maliput. %age = 23.75

EricCousineau-TRI commented 5 years ago

I think the next best step is to try and more reasonable carvings of the current metrics, and a little more refining on the filters.

Can you try setting it up for doing queries based on types (if possible), like:

Count the number of classes covered
Count the number of methods inside covered classes that are bound
Count the number of functions covered (not inside of a class)

Additional things that would be nice to report:

Which files have the least coverage?
Which directories have the least coverage?

Additionally, can you add the following directories to your ignore list?

common/proto
common/yaml
examples/  # We do have some bindings, but I'm not too considered with these bindings at the moment.

After this, we should then define file markers / a file manifest that states whether a given file is geared towards pure C++ usage (e.g. nice_type_name.h, hash.h, eigen_types.h). However, that could possibly be filed as a follow-up issue.

EDIT: Er, but to answer when to start polishing - let's try to update the filter and metric carving first.

m-chaturvedi commented 5 years ago

Additionally, can you add the following directories to your ignore list?

Here %age = 25.60

EricCousineau-TRI commented 5 years ago

Can I ask if you have an update for the additional coverage metrics?

m-chaturvedi commented 5 years ago

I'm now using xpath to get the answers to the coverage questions. https://github.com/m-chaturvedi/drake/compare/master...m-chaturvedi:meeting_branch_1

m-chaturvedi commented 5 years ago

Will be adding type-wise coverage as well. Edit: Added type-wise coverage.

EricCousineau-TRI commented 5 years ago

Will you be able to have a rough-cut of ~type-wise and~ directory-wise coverage by this upcoming Tuesdsay EOD?

EDIT: Just saw edit for type-wise.

m-chaturvedi commented 5 years ago

Sure.

m-chaturvedi commented 5 years ago

Here

Updated link after granting permissions.

EricCousineau-TRI commented 5 years ago

Looks good! Two minor requests:

Can you ensure that the list is alpha-sorted? (current order seems... random?)
At the start of a new directory (e.g. drake/, drake/common/, drake/systems/primitives/), can you also show overall directory coverage?

m-chaturvedi commented 5 years ago

Updated.

EricCousineau-TRI commented 5 years ago

Thanks! Looks much better.

Some minor follow-up nits:

The different cursors types are a bit noisy. For now, I think I'd only care about the following distinctions:
- class (including structs)
- type aliases
- functions (including static class methods)
- instance methods (including constructors)
- enums
- For classes + functions / methods, it would be nice to express if it's a template or not in a simple-to-present fashion.
The presentation is good, but it's still a wee bit terse. It'd be nice to possibly have multiple views of the report - e.g. this Excel spreadsheet, some Juptyer notebook with collapsing trees like baobob (Disk Usage Analyzer) on Ubuntu (relates #9650)

We can file these as follow-up issues or TODOs or whatever, just as long as the code that lands is easy to achieve these things in the future (which hopefully it is?).

I think overall, this is almost ready to start polishing and land on master. Before we do that: @sammy-tri Can I ask if there are any other coverage aspects you think we might want to cover?

\cc @jwnimmer-tri @RussTedrake

m-chaturvedi commented 5 years ago

Updated: https://docs.google.com/spreadsheets/d/1wlhPxnYbJ1YSUIXZUaasagBf8Nbt8QbiTnPht0J6K8k/edit?usp=sharing

I used the following dictionary for making those classes: https://gist.github.com/m-chaturvedi/2016db62767119af431b04a9adc45667#file-parse_pybind_xml-py-L32-L41

EricCousineau-TRI commented 5 years ago

Minor readability request: When you add links in, rather than inlining them into entire statements, can you present them after a colon? e.g.

Updated: <link>
Using the following dictionary for those classes: <link>

Will check this out later today, thanks!

m-chaturvedi commented 5 years ago

Sure, thanks! I used the following dictionary for making those classes was even factually incorrect, the link didn't follow the text :).

EricCousineau-TRI commented 5 years ago

Just took a peak, looks generally good. A few more minor notes:

You may want to update your sorting based on file vs. directory; at present, you're just sorting lexically, which means you can have an ordering like:

a/b/c.txt
a/b/d/e.txt  <-- Directory in the middle.
a/b/f.txt

Probably using a simple sorting heuristic would fix that, I think?

Other than that, perhaps we can get a first version set up to PR against master?

EricCousineau-TRI commented 5 years ago

Ah, but first, could you also do the breakdown on symbol coverage, esp. with nesting (e.g. in namespaces and in classes, etc.) - not just files?

m-chaturvedi commented 5 years ago

You may want to update your sorting based on file vs. directory:

I did that because then there would be too many DirCoverage values (1st column), which wouldn't make too much sense.

Ah, but first, could you also do the breakdown on symbol coverage, esp. with nesting (e.g. in namespaces and in classes, etc.) - not just files?

Sorry had forgotten to update: https://docs.google.com/spreadsheets/d/1MQuzuluffw7CJ5tivJOmOKfE88bI9clTkZX7e9gZ8H0/edit?usp=sharing

Other than that, perhaps we can get a first version set up to PR against master?

OK.

EricCousineau-TRI commented 5 years ago

I did that because then there would be too many DirCoverage values (1st column), which wouldn't make too much sense.

Per f2f, resolution is that files should be clustered by their immediate directory (not that directories are first in list, as you'd see in a file explorer).

Target Date for initial PR is Friday, 2019-09-13.

m-chaturvedi commented 5 years ago

Hey @EricCousineau-TRI ! I want to know what would be an expected workflow for getting the plots / csv files.

That is, would we be generating the coverage csv's when running the generate_pybind_documentation_header command? (Currently, I generate an XML file needed for the remaining analysis using it.)

Or, perhaps I should add another Bazel command that needs to be run for generating the csv files and then yet another one for opening a Jupiter notebook that shows them? (This new command would need to access the .ccs inside the bindings/pydrake directories.)

EricCousineau-TRI commented 5 years ago

Sorry for the late response here - the next step is to get the script that output a CSV onto master on your WIP PR.

After that has landed on master, we can work on Jupyter notebook output / whatever else we use for presenting the results.

EricCousineau-TRI commented 4 years ago

Mmanu has split up his PRs into simpler pieces, still reviewing but should land soon.

EricCousineau-TRI commented 4 years ago

Since #12040 has merged, going to close this. If we need a public report (e.g. via Jenkins / CDash), then we can file a separate issue.

RobotLocomotion / drake

pydrake: Produce coverage of current bindings #11588