elastic / apm-data

apm-data holds definitions and code for manipulating Elastic APM data
Apache License 2.0
12 stars 25 forks source link

Fix NPE due to reslice only initializing new slice elements #327

Closed lahsivjar closed 1 month ago

lahsivjar commented 1 month ago

Fixes https://github.com/elastic/apm-data/issues/326

Run the test case in https://github.com/elastic/apm-data/issues/326#issuecomment-2245636455 to understand the root issue. Note that the test case is not added here since it is too specific for reproducing the bug and, IMO, will not be useful as a test for long-term maintenance.

kruskall commented 1 month ago

this might be similar to https://github.com/elastic/apm-data/pull/165

lahsivjar commented 1 month ago

@kruskall Yup, you are right they are doing the same thing. Do you want to take that forward instead? I think they are mostly same so we can merge either one... both should fix the NPE edge case.

lahsivjar commented 1 month ago

@kruskall While I like the other PR for removing the util method for populate nil, I think the behavior of Reslice would be different if there are elements in the slice already added. Since this is active now, let's go with this PR unless you have any concerns.

carsonip commented 1 week ago

✅ test-plan-ok

Tested manually by checking out to 8.15.0 tag and 8.15 branch

Adding the following test (supplied in original issue) to input/elasticapm/processor_test.go:


func TestBadReslice(t *testing.T) {
    metadata := `{"metadata": {"service": {"name": "1234_service-12a3","node": {"configured_name": "node-123"},"version": "5.1.3","environment": "staging","language": {"name": "ecmascript","version": "8"},"runtime": {"name": "node","version": "8.0.0"},"framework": {"name": "Express","version": "1.2.3"},"agent": {"name": "elastic-node","version": "3.14.0"}},"user": {"id": "123user", "username": "bar", "email": "bar@user.com"}, "labels": {"tag0": null, "tag1": "one", "tag2": 2}, "process": {"pid": 1234,"ppid": 6789,"title": "node","argv": ["node","server.js"]},"system": {"hostname": "prod1.example.com","architecture": "x64","platform": "darwin", "container": {"id": "container-id"}, "kubernetes": {"namespace": "namespace1", "pod": {"uid": "pod-uid", "name": "pod-name"}, "node": {"name": "node-name"}}},"cloud":{"account":{"id":"account_id","name":"account_name"},"availability_zone":"cloud_availability_zone","instance":{"id":"instance_id","name":"instance_name"},"machine":{"type":"machine_type"},"project":{"id":"project_id","name":"project_name"},"provider":"cloud_provider","region":"cloud_region","service":{"name":"lambda"}}}}`
    errstacktrace := `{"error": {"id": "cdefab0123456789", "trace_id": null, "timestamp": 1533826745999000,"exception": {"message": "Cannot read property 'baz' no defined", "stacktrace": [{"classname": "cl1", "abs_path": "abs_path1"}, {"classname": "cl2", "abs_path": "abs_path2"}, {"classname": "cl3", "abs_path": "abs_path3"}, {"classname": "cl4", "abs_path": "abs_path4"}, {"classname": "cl5", "abs_path": "abs_path5"}]}}}`

    batchProcessor := modelpb.ProcessBatchFunc(func(ctx context.Context, batch *modelpb.Batch) error {
        return nil
    })
    p := NewProcessor(Config{
        MaxEventSize: 100 * 1024,
        Semaphore:    semaphore.NewWeighted(1),
    })

    // Below, we try to emulate a case where all of the capacity of Error#Exception#Stacktrace slice
    // is not populated with non-nil values. In practical cases, this could happen with OTLP handling
    // of exceptions: https://github.com/elastic/apm-data/blob/cee5ac3fc5f2ee66f156133992882cae758e2b66/input/otlp/exceptions.go#L165
    event1 := modelpb.APMEventFromVTPool()
    event1.Error = modelpb.ErrorFromVTPool()
    event1.Error.Exception = modelpb.ExceptionFromVTPool()
    for i := 0; i < 3; i++ {
        // Add 3 so that reallocation makes the capacity 4
        // this will cause the 4th element to be nil
        event1.Error.Exception.Stacktrace = append(event1.Error.Exception.Stacktrace, modelpb.StacktraceFrameFromVTPool())
    }
    require.Equal(t, 4, cap(event1.Error.Exception.Stacktrace))

    event1.ReturnToVTPool()
    event2 := modelpb.APMEventFromVTPool()
    payload2 := strings.Join([]string{metadata, errstacktrace}, "\n")
    err := p.HandleStream(
        context.Background(), event2,
        strings.NewReader(payload2), 10, batchProcessor,
        &Result{},
    )
    require.NoError(t, err)
}

Confirmed that it fails in v1.9.0 and passes in v1.9.1.

On a side note, I cannot wrap my head around how this would manifest practically. After analyzing the code statically, there is no code in apm-server other than otlp code that would possibly insert nil to stacktrace slice, but otlp APMEvents are not pooled. But investigation into that is a separate matter. This PR is test-plan-ok as it defends against bad pooled slices containing nils.