colcon / colcon-core

Command line tool to build sets of software packages
http://colcon.readthedocs.io
Apache License 2.0
102 stars 46 forks source link

Implementation docs? #262

Open mikepurvis opened 4 years ago

mikepurvis commented 4 years ago

There's a design page in the Sphinx docs, but so far it's pretty thin: https://colcon.readthedocs.io/en/released/developer/design.html

I'm just getting started as a colcon user, but also interested in questions around what its capabilities and limitations are as a framework, once I am at a comfort level with it to begin hacking on extensions, etc. For example:

It would be great to have some high-level explanation of what the steps are that colcon is taking to execute a build, what is managing what in terms of actions and responsibilities.

dirk-thomas commented 4 years ago

There's a design page in the Sphinx docs, but so far it's pretty thin: https://colcon.readthedocs.io/en/released/developer/design.html

Agreed, it certainly could benefit being extended. I will answer some questions inline below. For some I will reference PRs I created for the docs based on this ticket. If you have further questions like this please keep them coming. I will try to either answer them directly or convert the information into docs.

What is the story with setup, local_setup, and _local_setup_util? Which of these suites of files are being generated by colcon core vs the build type plugin vs the package's build system?

All *setup* files in the install prefix path are coming from colcon and colcon alone. Whenever a build system tries to write similar setup files into the install prefix (e.g. catkin) the generation is being suppressed (e.g. for catkin packages by passing -DCATKIN_INSTALL_INTO_PREFIX_ROOT=0).

Why is the build ordering logic located in _local_setup_util_sh.py, a file generated into the installspace?

The logic embedded in that Python file to order packages topologically is necessary when setting up the environment of packages too. The environment setup of a package might depend on the environment of its dependencies already been setup. Generating a statically ordered list of packages doesn't work when installing packages in multiple iterations into the same install space. Therefore the script first discovers the packages in the install space, orders them topologically and then sources the package specific setup files in that order.

How is the environment being prepared for a particular package's build? In the non-merged (default) install case, how are the install trees of the build dependencies ending up merged together?

When a package is being built it gets the environment defined by all its recursive dependencies. For each of these packages the environment is setup in topological order. This only needs to consider the packages in the workspace since other packages (external or from an underlay) require to have been sourced beforehand.

As a consequence if a package doesn't declare a build dependency it should fail to find its resources (assuming a not merged install space).

Does the merging work properly even if I --start-with a package partway through the build?

Yes. That being said must have built dependency packages at least once before you can build packages on top. But that could have happened in a previous invocation.

The install space doesn't store any information specific to the set of processed packages. It just has the artifacts of all previously installed packages and sources them in topological order.

What if I don't have the source for that package and only a cached/saved installspace that has local setup files? (these last two cases do not work properly with catkin_tools)

I am not sure I understand the question. I will try to but maybe I miss the point. Please rephrase then so I can answer it well.

While workspaces can be chained each workspace is being sourced in the order of the chaining. So first all packages within an underlay, then all packages within an overlay.

If you install two different workspace into the same install space then sourcing the install space considers all packages in that install spaces and sources them in topological order (it doesn't matter how that install space was populated).

It looks like a bunch of the environment setup (prepending variables and so-on) is being handled by the files in share/[pkg]/hook/*.dsv. This isn't a catkin concept though; is it a colcon thing? It appears these are mostly generated from templates originating in colcon-core, but these seem pretty specific to the buildsystem in question. What's going on here?

Please see colcon/colcon.readthedocs.org#51 which describes the general environment setup as well as the role of .dsv files.

What is happening during the big pause at the beginning of a colcon build invocation? I'm assuming it's scanning the workspace to discover packages,

Correct. Recently a few changes have landed (but not yet released) which improve the performance of the involved code quite a bit (#256, #257, #259) but the time spent in this phase is still significant for larger workspaces.

but I'd love to know more about where the logic for that lives and how it works (and why it takes 10+ seconds for ~600 packages).

Similar to the previously referenced PRs there is probably potential to reduce the overhead by improving data structures and overall flow. It would be best to profile such a use case (as done for the past PRs) to identify the most significant bottlenecks - maybe persisting the parsed XML results could be a fruitful option (just speculating).

It would certainly help if that part would be parallelized (as the actual build is).

Part of this is that it would be great to be able to run that once and cache it for successive invocations,

That sounds not trivial given the number of things which could change and would affect the result (most of all any change in the file system in locations being considered by the discovery step). If we would want to go that route we should make sure that this can be realized through an extension (which the core probably will need to add a new extension point for?).

but I'm also curious about whether the discovery stuff could be extracted into a form usable by rosdistro, such that packages listed as source in a distribution.yaml wouldn't necessarily need a package.xml to be patched into them?

One of the slides from the recent ROSCon presentation briefly mentions the parts involved in this (see https://github.com/dirk-thomas/slides_roscon2019_colcon):

  1. Discovery: determine directories to check for a package, commonly crawling recursively.
  2. Identification: determine if a directory contains a package as well as its name and type (the latter deciding the recipe how it should be processed)
  3. Augmentation: add metadata to identified packages (e.g. additional dependencies which couldn't be extracted from the identification logic since they are not available in a machine readable form)

Each of these steps is encapsulated by an extension point in colcon which is ultimately being implemented by multiple extensions across different packages. E.g. colcon-ros provides an identification extension based on package.xml files). The general design goal in colcon is that all of this logic should be accessible through API. So it should be feasible to reuse it from other Python code. (Not sure if it would be desired to do that from rosdistro and what the implications would be - e.g. the information only being available after cloning the package in full and running the Python logic.)

It would be great to have some high-level explanation of what the steps are that colcon is taking to execute a build, what is managing what in terms of actions and responsibilities.

Please see colcon/colcon.readthedocs.org#52 for the high level program flow of the list and build verb. Additionally colcon/colcon.readthedocs.org#53 briefly enumerates the most important extension points. I hope that helps to get a better overview how things fit together.

mikepurvis commented 4 years ago

Another aspect of colcon about which I was unable to learn anything from the existing docs is the Observer system. I'm working on a plugin which implements a new verb that has nothing to do with building and doesn't use the build space argument. It mostly works, but I get this angry message upon completion of my verb's executor run:

[0.608s] ERROR:colcon.colcon_core.event_reactor:Exception in event handler extension 'compile_commands': 'Namespace' object has no attribute 'build_base'
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/colcon_core/event_reactor.py", line 78, in _notify_observers
    retval = observer(event)
  File "/usr/lib/python3/dist-packages/colcon_cmake/event_handler/compile_commands.py", line 47, in __call__
    package_level_json_paths = self._get_package_level_json_paths()
  File "/usr/lib/python3/dist-packages/colcon_cmake/event_handler/compile_commands.py", line 97, in _get_package_level_json_paths
    json_path = self._get_path(package_name)
  File "/usr/lib/python3/dist-packages/colcon_cmake/event_handler/compile_commands.py", line 114, in _get_path
    path = Path(self.context.args.build_base)
AttributeError: 'Namespace' object has no attribute 'build_base'

I can silence this by patching in context.args.build_base = '', but I'd like to understand better where this observer thing came from and why it is here. Being in the colcon_cmake package, it appears connected to the build_type, but I can't see what it is I'm invoking that would have prompted this thing to exist or be observing, or even the event to have occurred.

It appears that https://github.com/colcon/colcon-cmake/issues/81 raises a similar issue, but my immediate concern is more just about understanding the relationship between these event handlers and the rest of the system.

dirk-thomas commented 4 years ago

Another aspect of colcon about which I was unable to learn anything from the existing docs is the Observer system. I'm working on a plugin which implements a new verb that has nothing to do with building and doesn't use the build space argument.

Anything happening during the invocation of colcon is represented as an event and pushed into a queue. E.g. a package starts/ends, a subprocess is invoked / returned, a subprocess generates stdout / stderr output, etc. Some event types can be found in colcon-core.

The event reactor (implementing the reactor pattern) processes each event from that queue and passes it to various event handlers (which implement the EventHandlerExtensionPoint). Those contain the actual logic to e.g. print the information of an event on the console or to a log file.

It mostly works, but I get this angry message upon completion of my verb's executor run:

The CompileCommandsEventHandler is the one raising an exception. It is used to build a compile_commands.json file after an invocation. Conceptionally that only needs to happen after a build but unfortunately the event handler doesn't have the necessary context to not perform the action in other cases - either for test as described in colcon/colcon-cmake#81 (which is just unnecessary resource usage) or in your case for a custom verb (where it fails since there is no build_base argument).

For now there wasn't a strong reason to resolve the issue since all other verbs had a build_base so beside the extra resource usage this wasn't a problem. Your temporary workaround is certainly one way to about the error message. Another temporary solution would be to disable the extension in question by setting the environment variable COLCON_EXTENSION_BLOCKLIST=colcon_core.event_handler.compile_commands.

Anyway I just created colcon/colcon-cmake#88 which should resolve the original issue and I would hope also addresses the problem with your custom verb. It would be great if you could give it a try and comment if that patch works for your use case.

mikepurvis commented 4 years ago

That is helpful to understand, and thanks for putting up the fix so promptly.

Relatedly, I have a situation where I would like to be able to plug in some extra per-package pre/post build functionality to colcon (think analytics, reporting, etc, but could also be heavier-weight stuff like packaging-related tasks). I don't want to have to individually wrap each build task to insert this extra functionality, nor do I want to have to hack on the build verb.

Is listening for JobStarted/JobEnded events and just inserting the work there a reasonable thing to do? Or is there a better way to handle this that exists (or is planned)? All the event handlers discussed here seem to be super lightweight and related to logging/reporting. Perhaps there is an opportunity for a TaskAugmentationExtensionPoint which permits arbitrary code to be run pre-Task and post-Task, in a way that is properly reflected in the status bar, generates log files, etc?

dirk-thomas commented 4 years ago

Is listening for JobStarted/JobEnded events and just inserting the work there a reasonable thing to do?

It probably depends on what exactly you want to do. For light stuff which can happen asynchronously to the ongoing build that sounds fine to me.

For heavier-weight stuff the issue is that all events are being processed sequentially and long running blocking event handler would negatively affect / delay all following events. Maybe you could overload the work into a thread started by the event handler and make sure to join the thread when the EventReactorShutdown is handled.

One big downside of the whole approach is that there is no clear way to interact with the user since you can't easily generate / post events from that logic. So it feels more like the wrong way of doing it.

Or is there a better way to handle this that exists (or is planned)?

Atm there is no plan to modify / extend anything in that area. But with enough detailed use cases we could come up with a plan how they could be handled best and what would need to be changed / added to enable them. Maybe you can try to describe the desired tasks with some details to get a better idea about the desired features?

All the event handlers discussed here seem to be super lightweight and related to logging/reporting.

graphviz_anim is probably the most extreme one since it might take several seconds (if not longer) but since it is happening at the very end of the invocation its isn't too harmful.

Perhaps there is an opportunity for a TaskAugmentationExtensionPoint which permits arbitrary code to be run pre-Task and post-Task, in a way that is properly reflected in the status bar, generates log files, etc?

As a kind of cross cutting aspect - that sounds like an interesting thought. You would probably have to balance what kind of work should be done in those since they would need to be applicable to any kind of task (build vs. test, cmake / python / ament / ros pkgs).

mikepurvis commented 4 years ago

You would probably have to balance what kind of work should be done in those since they would need to be applicable to any kind of task (build vs. test, cmake / python / ament / ros pkgs).

It would be important for the augmenting function/plugin to have some insight into what the verb/builder/task is— could be as simple as just passing that information through and letting it short return if its functionality isn't applicable. Or there could be a mechanism to pre-filter, could be as simple as just a regex against the name or something. Certainly to be useful, it would need to have access to whatever the TaskContext is from the associated task.

I still don't have as clear a sense how all this works in Colcon-land. The catkin_tools approach is very declarative in this regard, with explicit, named CommandStage/FunctionStage instances that have to defined upfront, whereas it seems that fully embracing the coroutine model has meant colcon can be more flexible with basically just running arbitrary Python code to drive the build, and yielding whenever a subprocess or whatever else happens.

As far as my use-case, I'm experimenting with tarring up the installspaces after build and sending them off to a cache where they can be pulled in to pre-cook the next build. The overall operations are:

Currently I'm doing this in a separate verb, invoked when the build is complete, however, there are several reasons to want to do it in an interleaved manner—

So yeah, I'm not sure what other use cases might justify the addition of a formal extension point here, but if you think it might have legs, I can open a separate issue to discuss further. In the meantime, I'll maybe wander down the path of kicking this work onto a separate background thread that manages its own logging and so-on.

MaximilienNaveau commented 3 years ago

Hi,

Additional doc would be welcome on using --symlink-install, I created a pure cmake (no ament nor catkin) package and I am installing some python scripts from the source folder. I would like these to be installed as symlinks when using colcon build --symlink-install. Though in the documentation the information is very limited.

Thanks for your help.

dirk-thomas commented 3 years ago

@MaximilienNaveau --symlink-install requires support from the build system. For a pure CMake package that isn't the case and therefore the installation will copy files. In order to support this feature for your pure CMake package colcon-cmake would first need to pass that flag somehow (likely through an environment variable) and then your CMake code would need to override the install() behavior. That is one of the features ament_cmake provides.