census-instrumentation / opencensus-erlang

A stats collection and distributed tracing framework
https://opencensus.io
Apache License 2.0
134 stars 39 forks source link

Potential Memory Leak #168

Open TylerPachal opened 4 years ago

TylerPachal commented 4 years ago

I have an Elixir project where I am using opencensus-erlang for the first time. I have a helper function for wrapping pieces of my code in spans:

def span(name, attributes, func) do
  parent_span_ctx = :ocp.current_span_ctx()

  new_span_ctx =
    :oc_trace.start_span(name, parent_span_ctx, %{
      :attributes => attributes
    })

  :ocp.with_span_ctx(new_span_ctx)

  try do
    func.()
  after
    :oc_trace.finish_span(new_span_ctx)
    :ocp.with_span_ctx(parent_span_ctx)
  end
end

To talk to my collector I am using the opencensus_service library.

In my application code, I have a layout that is something like this, with a parent span that has some children (which may have children themselves):

span("span_a", attributes, fn ->
  span("span_b", attributes, fn ->
    span("span_c", attributes, fn -> :ok end)
    span("span_d", attributes, fn -> :ok end)
  end)
end)

When running this code, my application runs out of memory very quickly, and the offending process is oc_reporter. I was able to replicate this locally and watch the memory usage with Observer. The memory usage is mainly binary:

Memory Usage Processes

I thought maybe that I am sending too much data, but after poking around in Observer I don't see the ETS tables growing unbounded or queues sizes increasing, so I don't think that is the cause. When running locally I had a Collector setup that was just writing to file. In production our Collector publishes to Kafka.

I also tried forcing some garbage collections and hibernating to see if there was any memory I could record that way, but that didn't help at all either.

I am now stuck and don't really know where to go from here:

  1. Either I am using this library incorrectly, or
  2. There is a memory leak somewhere that occurs under heavy usage caused by some piece slowing down.
tsloughter commented 4 years ago

Hm, I'm surprised it isn't helped by manually running garbage collection since it is a binary leak.

Another hack you could try is to kill the reporter and see if the garbage is cleaned up.

If you are only using tracing then my suggestion would be to move to OpenTelemetry https://github.com/open-telemetry/opentelemetry-erlang -- I reworked how the reporter works in such a way that garbage should be less likely to accumulate because a new process is used for each report run and then discarded.

But note, there are 2 PRs that will change how it works, we are nearing GA and will stop breaking the API soon, but they don't change how the Elixir with_span macro works which I think you would want to use as they are essentially the same as the function you wrote to wrap OpenCensus.

If an issue with changing is that you already have the OpenCensus collector in use in your environment so can't simply change to the OpenTelemetry collector then I can get a version of the Erlang OC reporter ported over to OpenTelemetry.

TylerPachal commented 4 years ago

Thanks for the quick comment!

... but they don't change how the Elixir with_span macro works which I think you would want to use ...

Where is that coming from? Is there a different dependency I should be using in my Elixir project?

tsloughter commented 4 years ago

I was referring to OpenTelemetry, not OpenCensus. In the OpenTelemtry API there is an Elixir API https://github.com/open-telemetry/opentelemetry-erlang/blob/master/apps/opentelemetry_api/lib/open_telemetry/tracer.ex