bloomberg / memray

Memray is a memory profiler for Python
https://bloomberg.github.io/memray/
Apache License 2.0
13.17k stars 392 forks source link

Move the runner into a separate package with minimal dependencies #557

Open kshpytsya opened 6 months ago

kshpytsya commented 6 months ago

Is there an existing proposal for this?

Is your feature request related to a problem?

The system I am currently working on involves some confidential customer data to which regular developers do not have access. Due to the complex nature of ML algorithms and libraries involved, sometimes we encounter anomalous memory usage which we cannot easily reproduce with our test data set. I am considering integration of some memory profiling mechanism, possibly involving memray which would be selectively enabled in production environment to help us to get insights into problematic cases without requiring access to customer data.

For multiple reasons (security, reliability, docker image size to name a few) we try to minimize the number of our dependencies. Memray has a number of dependencies which are unnecessary for trace collection:

https://github.com/bloomberg/memray/blob/578e02d20bc63bddc0fc25c28bf7569519b89838/setup.py#L91-L96

and those have their own dependencies as well.

Describe the solution you'd like

It would be great to split memray into two packages, tentatively named memray-core and memray. The former would only be able to perform collection, and the later would do everything memray does today.

Alternatives you considered

Another possibility would be having all the dependencies not related to memray.Tracker to be under "extras" but that would probably be a nuisance for most ordinary users who would want a simple pip install memray to install everything.

godlygeek commented 6 months ago

Hm. It's true that Jinja2, Rich, and Textual aren't needed for collecting profile data, but they are needed for generating reports from the profile data, and as a general rule, we expect the reporters to be run on the same system as the report was generated on... In the case of --native mode reports, it can be extremely difficult to reproduce an environment on a different machine where the instruction pointers and shared library names stored in the capture file are able to be interpreted correctly. Because of that, the idea of a memray-core package with no support for interpreting the capture file seems problematic. It encourages a workflow that's likely to cause at least a lot of friction, and that may prevent certain types of reports from being generated from the captured file at all.

kshpytsya commented 6 months ago

I do realize it may need extra refactoring but I do believe it would be possible to implement the collection phase (including the extraction the required information from the running environment, e.g. from shared libraries) without those libraries, and make the resulting files completely portable. Of course, the code is yours, but to me this would seem a better architecture.

To clarify, I envision roughly the following flow:

  1. with memray_core.Tracker: ... exactly what it does today. memray.Tracker would be just a re-export of memray_core.Tracker.
  2. memray_core.augment(<list of files produced by Tracker>, ...) (tentative name) either produces a single or multiple files containing all the additional data required to make files from step 1 portable, or just aggregates everything into a single file containing both data from step 1, and all the additional data.
  3. memray flamergraph|table|tree|parse|summary|stats <data from step 2> can be executed on a different machine (possibly running different OS and Python version).

I am unsure whether a CLI for steps 1 and 2 would be desired/useful.

pablogsal commented 6 months ago

Of course, the code is yours, but to me this would seem a better architecture.

It's better in the sense that's more modular and allows to do what you are suggesting but has other significant drawbacks:

In any case, we will discuss this as there is some potential gain here but my guess is that we will unfortunately need to reject this suggestion as this complicates the maintenance and release and also entails some extra development to keep a workflow that is not the most common one.

pablogsal commented 6 months ago

Update: we agreed that this makes sense to do and we will investigate ways to handle the complexity. Not sure how much it will take but we will probably do some form of this 👍

sarahmonod commented 4 months ago

After talking about it, we decided we would take the following approach:

  1. Make an empty memray package that contains NO code, and that depends upon memray_core[all] of the exact same version (so memray==a.b.c depends on memray_core==a.b.c)
  2. Make a new package called memray_core that contains all of the code, but with optional dependencies, so that we can separate dependencies into multiple groups: html-reporters, textual-reporters, and all (that third one containing the union of the two kinds of reporters)
  3. Change memray to be able to run with either no reporter (so only the tracking API + CLI), or only one reporter, or none.

That last one means we also need to:

  1. Change the listing of all reporters (by the CLI) to be dynamic, on a plugin-like system that can handle all combinations of reporters being installed
  2. Change the live reporter to exit and fail with a meaningful error message if the textual-reporters aren't installed
bollwyvl commented 2 months ago

Cheers to this effort! :tada:

Wearing my reproducibility hat: [extras] are... pretty disappointing as a long term strategy, compounded when extras invoke other (downstream) extras. These constraints can be ignored on successive pip installs, which could lead to undesirable behavior, and aren't checked by pip check for satisfiability. This leaves the dreaded "more packages," alternative but these can be relatively painless with modern PEP517 backends... flit in particular excels at making publishable packages from one-hand-after-fourth-of-july-countable files (e.g. the_module.py, LICENSE, and README).

So in the "more packages" architecture, a pure-python memray-interfaces could be at the intersection of all dependencies (core and options). This could describe:

Ignoring one vs many packages, as for plugins: there's really very few approaches as simple and robust as declarative entry-points, and using this inside a package really feels quite good:

[project.entry-points."memray.v0.reporter"]
console = "memray.reporter.console:ConsoleReporter"
html = "memray.reporter.html:HTMLReporter"

Where the reporter is a well-typed class description of the minimum it needs to do the thing, e.g. Reporter.cli_parser() -> ArgumentParser, Reporter.report(data) -> str. Some basic sanity checking when importing these pretty much handles composable behavior in the face of optional dependencies, either by failing to import (with or without warning) and/or having a Reporter.check_dependencies() -> bool.

This has the follow-on of allowing downstreams to also participate as first class citizens, e.g. for a speedscope reporter:

[project.entry-points."memray.v0.reporter"]
speedscope-json = "memray_speedscope:SpeedScopeJSONReporter"
speedscope-html = "memray_speedscope:SpeedScopeHTMLReporter"

Good luck, looking forward to this!

godlygeek commented 2 months ago

Really, what we want is almost the opposite of extras. In a perfect world, we wouldn't have opt-in dependencies, we'd have opt-out dependencies. pip install memray would give you all of the first party reporters, and if for some reason you didn't want one of those, you could do pip install memray[-textual] or some such (just spitballing some syntax) to say that you want memray, but not its reporters that require Textual. But that's not the tooling we've got to work with today...

agriyakhetarpal commented 2 months ago

See https://github.com/tiangolo/typer/blob/master/pdm_build.py and https://github.com/tiangolo/typer/blob/master/pyproject.toml#L62-L105 as an example for a real-world package that implements "negative optional dependencies" (and similarly, FastAPI and others). There is no documentation about this since it is an internal configuration, but following their footsteps could be a viable option.

agriyakhetarpal commented 2 months ago

xref for the much-requested (and awaited) setuptools feature request on their issue tracker: https://github.com/pypa/setuptools/issues/1139

bollwyvl commented 2 months ago

real-world package

As a downstream repackager of some of those packages... yeah, please don't. If one were to pip install whatever-slim and then whatever and then uninstall whatever-slim, they would get... a broken whatever. On conda-forge we do some... terrible things to ensure that doesn't happen, and appeases pip check, but really can't encourage that direction.

I humbly submit that multiple lightweight packages with real pins (hard == ones on the core package, sensible, tested ones on third-party dependencies) will beat pep517 build MAGIC_ENV_VAR trickery every time. Then, [extras] can be used to refer to those, but they'll never go pear-shaped on subsequent solves, as the real pins will tie the room together.