Instagram / LibCST

A concrete syntax tree parser and serializer library for Python that preserves many aspects of Python's abstract syntax tree
https://libcst.readthedocs.io/
Other
1.56k stars 192 forks source link

[Windows, Mac] Can't use call_if_inside/call_if_not_inside matcher decorators with multiprocessing #1181

Closed kiri11 closed 2 months ago

kiri11 commented 3 months ago

This issue has been reported before (https://github.com/Instagram/LibCST/issues/848, https://github.com/Instagram/LibCST/issues/629, https://github.com/Instagram/LibCST/issues/435), but I debugged it and found the root cause, so decided to create another one with more information.

Example traceback:

Traceback (most recent call last):
  File "libcst/codemod/_cli.py", line 285, in _execute_transform
    output_tree = transformer.transform_module(input_tree)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "libcst/codemod/_command.py", line 71, in transform_module
    tree = super().transform_module(tree)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "libcst/codemod/_codemod.py", line 108, in transform_module
    return self.transform_module_impl(tree_with_metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "libcst/codemod/_visitor.py", line 32, in transform_module_impl
    return tree.visit(self)
           ^^^^^^^^^^^^^^^^
  File "libcst/_nodes/module.py", line 89, in visit
    result = super(Module, self).visit(visitor)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "libcst/_nodes/base.py", line 227, in visit
    _CSTNodeSelfT, self._visit_and_replace_children(visitor)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "libcst/_nodes/statement.py", line 1814, in _visit_and_replace_children
    body=visit_required(self, "body", self.body, visitor),
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "libcst/_nodes/internal.py", line 81, in visit_required
    result = node.visit(visitor)
             ^^^^^^^^^^^^^^^^^^^
  File "libcst/_nodes/base.py", line 236, in visit
    leave_result = visitor.on_leave(self, with_updated_children)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "libcst/matchers/_visitors.py", line 515, in on_leave
    if _should_allow_visit(
       ^^^^^^^^^^^^^^^^^^^^
  File "libcst/matchers/_visitors.py", line 425, in _should_allow_visit
    return _all_positive_matchers_true(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "libcst/matchers/_visitors.py", line 404, in _all_positive_matchers_true
    if all_matchers[matcher] is None:
       ~~~~~~~~~~~~^^^^^^^^^
KeyError: FunctionDef

Root cause

The matchers are dataclasses with __hash__ method returning their id. Their id changes in every process, so we can't compare their instances created in different processes.

Even though the same matcher was added to all_matchers, this was when transformer/visitor initialized in the main process. The check causing exception happens in the child process, which has its own copy of all matchers with another set of ids.

This might work on Linux because it uses fork() for multiprocessing by default. But Windows and MacOS use spawn() with a fresh Python interpreter in each child process. And it will break even on Linux in Python 3.14+ when Python changes the default way to start child processes in Linux.

The issue does not happen with leave/visit matcher decorators, because they are only used with matches, never compared directly.

Potential solutions

  1. _cli.py can re-initialize _matchers of the MatcherDecoratable transformer/visitor in each child process. Most straightforward way and easy to implement. MatcherDecoratable class can have a function just for that. But every other caller must know to do that in the multiprocessing context.

  2. Make MatcherDecoratableTransformer/MatcherDecoratableVisitor multiprocessing-safe, but how? Maybe delayed initialization of _matchers? But it could still be accessed before the spawn. Or maybe it could save the pid with which matchers were initialized and re-initialize them if pid changes? Seems complicated.

  3. Another way that I don't see? Open for discussion.

kiri11 commented 2 months ago

I decided to go with the second approach (delayed initialization). Matchers are populated right before we need them, so they will end up in the same process.