Capture VCS information

graphaelli commented 4 years ago

To enable code deeper code integration a way to identify the exact version of service/file/dependency would be very useful. #130 introduces some dependency-specific concepts. The proposal here is to capture version control information for all first party/non-dependency code, though a single, resuable definition of fields probably makes sense. https://github.com/elastic/ecs/issues/560 may be useful as well.

Note that service.version may also contain useful information but isn't purpose built for integration purposes - it commonly contains user-defined values like v7, 2019-09-16-001, etc.

I propose we capture the following fields under vcs:

field	definition	example value
`repository`	source code location	`https://github.com/elastic/apm.git`
`revision`	revision identifier, eg git ref	`f009776789e91de1909ca59d8415df64c8db16bb`
`type`	vcs type	`git`
`url`	related web URL	`https://github.com/elastic/apm`

That group would be captured under service initially, and considered for inclusion in each stacktrace.frame later. The result would be service.vcs.revision, etc.

Agents would collect this information automatically if possible, eg if a .git directory is present, and otherwise provide a mechanism for users to specify vcs information.

axw commented 4 years ago

What's the difference between vcs.repository and vcs.url? I mean obviously one has .git on the end and one doesn't, but isn't that a matter of canonicalisation? You can clone with or without the trailing .git.

axw commented 4 years ago

To answer my own question (I think): vcs.repository is the VCS URL, and vcs.url is the related web URL.

graphaelli commented 4 years ago

Thanks for pointing that out @axw, just added descriptions for each field.

Qard commented 4 years ago

Github just happens to have very similar repo and web URLs, some other platforms have quite different URLs.

roncohen commented 4 years ago

I think we need to consider carefully if this should be under service. I think we might run the risk of adding all our fields under service because everything seems service related. For example, we have host.name but we could also easily have put it under service.host.name. We did it for service.node.name but I'm not even sure that was right :)

roncohen commented 4 years ago

happy to see this move forward btw @graphaelli. Thanks for getting it started. Will we support customers building a service from multiple repositories?

graphaelli commented 4 years ago

The nesting proposed is driven partially by multiple repository support, whether those are multiple first-party or from dependencies. What this proposal is after is a way to specify a default code repository and eventually add an override code repository. service was chosen to align with the current design, which calls for a service to repo mapping. overrides could be specified by adding vcs information to a stacktrace frame, profile event, etc at a later time.

This means that given a given filename and line number, code would resolve the repo using: 1. stack frame 2. service 3. kibana mapping. The ref would have to be specified with the frame and otherwise fall back to the service for that information. We can dig into details later, but I expect if repo is provided at a more specific level, ref must also be provided.

Initially these 3 frames would map to code as follows:

{
  "service": {
    "environment": "production",
    "name": "some-node-service",
    "vcs": {
      "revision": "abcd1234"
    }
  },
  "span": {
    "stacktrace": [
      {
        "abs_path": "/app/server.js",
        "filename": "server.js",
        "library_frame": false,
        "line": {
          "number": 101
        }
      },
      {
        "abs_path": "/app/vendor/handle.js",
        "filename": "handle.js",
        "library_frame": false,
        "line": {
          "number": 67
        },
        "vcs": {
          "repo": "handle.git",
          "revision": "xxx"
        }
      },
      {
        "abs_path": "/app/node_modules/express/lib/router/layer.js",
        "filename": "node_modules/express/lib/router/layer.js",
        "library_frame": true,
        "line": {
          "number": 95
        },
        "vcs": {
          "revision": "4.17.1"
        }
      }
    ]
  }
}

/app/server.js -> kibana-mapped repo @ abcd1234 (from service.vcs.revision) /app/vendor/handle.js -> handle.git @ xxx

Knowing the final frame is a library_frame provides an opportunity for a kibana side mapping for libraries, but initially code would not be resolved for that bit.

Open to ideas / counter proposals / etc. The goal is to open issues as soon as possible for apm-server to intake and persist this information and agents to provide it.

axw commented 4 years ago

@graphaelli that makes sense to me, but I'm a little concerned about increasing the size of stack traces at the API level, given that this is now adding to every frame from a third-party repo, which in some apps/languages may be many. It's already expensive as it is for some agents.

Also, if we add profile events it may be useful to separate the mapping so we can reuse existing profiling formats.

Maybe we could go with optionally adding vcs info to stack frames for now, and later introduce a mapping in the metadata based on module prefixes? If we were to do that, would it still make sense to have vcs nested under service?

graphaelli commented 4 years ago

I'm a little concerned about increasing the size of stack traces at the API level

Same here, that's why I originally proposed we save that for later (vcs under stack frame), at the expense of not supporting multiple repositories / library frames.

if we add profile events it may be useful to separate the mapping so we can reuse existing profiling formats.

Agreed. This makes me rethink the proposed resolution chain for which repo to use. Perhaps the kibana mapping should always win if present so that older events that include repo data could still be used, in case the repo is moved after the event occurs. I think that's a discussion for another issue.

Maybe we could go with optionally adding vcs info to stack frames for now, and later introduce a mapping in the metadata based on module prefixes? If we were to do that, would it still make sense to have vcs nested under service?

I'm less focused on the repo mapping then on the revision. service.vcs.revision would still be needed in case repository isn't included in the initial list of fields at all, assuming mapping is only possible via kibana.

The benefit of allowing repo via intake is that agents can automatically report it (in some cases, eg .git/ exists), and then no user invention would be required to show code, beyond configuring the kibana code app with that repository. Eventually it's probably possible to import the code without manual intervention too, for well known repos or in case the package is the path like is common in go.

axw commented 4 years ago

I'm less focused on the repo mapping then on the revision. service.vcs.revision would still be needed in case repository isn't included in the initial list of fields at all, assuming mapping is only possible via kibana.

I was thinking that the mapping in the intake metadata would include both the repo and the commit, mapping from module name -> {rep, commit}. This assumes a single version/commit if each dependency though, and some languages do support multiple, so it wouldn't always be sufficient.

I'm still left wondering if service.vcs makes sense in that case. Perhaps vcs.service instead, indicating the vcs info to apply to application stack frames, in lieu of a higher precedence mapping? That would also leave room for, say, vcs.module, which would be a mapping from module names/prefixes to vcs info.

graphaelli commented 4 years ago

Discussed with @axw, mostly around ways to communicate the vcs information from agents efficiently while still enabling the UI to reliably map the code for a given frame.

Expanding on the comment above, the idea is to send vcs info via event metadata to perform mapping per frame to repo & revision later, perhaps in apm-server. In that case, it may be that there is nothing special about mapping first party vs dependency code. It's unclear whether that works for all languages / agents. I'm not sure file prefixes are reliable either as that information is probably lost in some languages.

For discussion sake, this is what metadata might look like if vcs were included only there instead of also in stack frames:

{
  "metadata": {
    "service": {"name": "myservice"},
    "vcs": {
      "modules": [
        {"name":"express", "repository": "https://github.com/expressjs/express.git", "revision": "4.17.1"}
      ]
    }
  }
}

It does look like a vcs.service would be useful after all, for non-library frames. I'm planning to continue exploring this option, feedback welcome in the meantime.

felixbarny commented 4 years ago

For the lack of better alternatives, the Java agent sets the package name as the module in stacktraces. One repository has multiple packages, however. I currently also don't see how we could detect the vcs information for an arbitrary stack frame.

graphaelli commented 4 years ago

Thanks for raising @felixbarny. This is a concern we've discussed a bit but haven't explored deeply. Curious to hear from other @elastic/apm-agent-devs on feasibility.

axw commented 4 years ago

One repository has multiple packages, however

Is that problematic? We can map multiple modules to the same repo.

I currently also don't see how we could detect the vcs information for an arbitrary stack frame.

I think this will generally be pretty complicated to set up for Go devs, particularly as more people move towards using Go Modules where vcs info isn't necessarily locally available. If it's helpful I can try to describe what I think would be a more ideal approach for Go, but I don't want to distract the conversation.

elastic / apm

Capture VCS information #147