Closed guhan-v closed 4 years ago
Adding some additional Pprof logs. Here are 2 taken back to back with memory jumps from ~3gb to ~5gb within a couple seconds. Right after that the pod OOM crashed
pprof.dgraph.alloc_objects.alloc_space.inuse_objects.inuse_space.014.pb.gz
pprof.dgraph.alloc_objects.alloc_space.inuse_objects.inuse_space.015.pb.gz
You can see resolve.squashFragments
jump from 2.21gb to 4gb in about 1 second. Then the whole pod crashes
Some additional information after seeing the following issue: https://github.com/dgraph-io/dgraph/issues/5315
I am running ~90 mutations/GQL mutation block. Each mutation is deeply nested to 2 levels deep.
Update: I have since dropped the total mutation block size to 10 mutations/gql block and still running through all of the memory
Hi, Here is most of my schema plus a sample of the type of mutations I'm doing. The mutation is dynamic based on the streaming input data that we get so it could change quite a bit, but this is probably the most common one.
type Workspace {
workspaceId: String! @search(by: [hash]) @id
workspaceName: String
}
interface Id {
key: String! @search(by: [hash]) @id
onWorkspace: [Workspace]!
hasTraits: [Traits]
hasGroupTraits: [GroupTraits]
}
type AnonymousId implements Id {
email: [Email] @hasInverse(field: anonymousId)
userId: [UserId] @hasInverse(field: anonymousId)
hasExperiment: [Experiment]
}
type Email implements Id {
anonymousId: [AnonymousId] @hasInverse(field: email)
userId: [UserId] @hasInverse(field: email)
}
type UserId implements Id {
anonymousId: [AnonymousId] @hasInverse(field: userId)
email: [Email] @hasInverse (field: userId)
}
type Traits {
id: String! @search(by: [hash]) @id
traitBlob: String! @search(by: [regexp])
integration: String! @search(by: [term])
onWorkspace: [Workspace!]!
createdOn: DateTime!
}
Mutation:
upd2: updateAnonymousId(input: {
filter: {key: {eq: "NewTest2"}},
set: { onWorkspace: [{workspaceId:"testWorkspace"}],
hasTraits: [{id: "%7B%22TEST%22%3A%22test%22%7D:CLIENT:testWorkspace", integration: "${integration}", traitBlob: "%7B%22TEST%22%3A%22test%22%7D", createdOn: "2020-05-11T17:37:40.664Z", onWorkspace: [{workspaceId:"testWorkspace"}]}],
userId: [{
onWorkspace: [key: "testUser", {workspaceId:"testWorkspace"}],
hasTraits: [{id: "%7B%22TEST%22%3A%22test%22%7D:CLIENT:testWorkspace", integration: "${integration}", traitBlob: "%7B%22TEST%22%3A%22test%22%7D", createdOn: "2020-05-11T17:37:40.664Z", onWorkspace: [{workspaceId:"testWorkspace"}]}]
}]
}
}){
anonymousid{
key
}
Hi @guhan-v I think I mentioned in slack that we were trying to fix this for 20.03.2 release. It didn't make it into that because that had an urgent release. We'll try and get it out in 20.03.3 asap.
Gotcha @MichaelJCompton , Thanks for the heads up. Do you happen to know if there is a schedule for that release? Just want to plan my next sprint appropriately
I'll touch base about a release when it firms up. There's a couple of changes heading for 20.03.3
This fix should be done by early next week.
Sounds good thanks for the heads up
Hey, sorry for the delay. We have merged a solution in the release/v20.03 branch. Upon running benchmarks in the new release branch, we can see significant improvement.
Would you be able to check if the solution works for you?
What version of Dgraph are you using?
20.03.1
Have you tried reproducing the issue with the latest release?
This is the latest release
What is the hardware spec (RAM, OS)?
K8s running on 3 M5.xLarge (4 vcpu/16 gb ram)
Steps to reproduce the issue (command/config used to run Dgraph).
K8s yaml for cluster setup
Hitting the Alpha load balancer with ~25-75 mutations/sec to ingest data into the graph or really any consistent flow of data into the alpha nodes
Expected behaviour and actual result.
Expected behaviour:
Actual Result:
Below shows spiking and failing of dgraph alpha nodes(Back when I put the memory limit at 6GB. Same things are happening with 12 GB
Continual cycle of OOM kill and crash loop backoff causing restarting
Image of the Pod cycling through OOM errors
Pprof logs do not show the same memory restraints, and only show 2GB used even when container is dying. Pprof inuse_objects show a rather high amount of objects though. I believe the issue lies with lack of GC, or potentially a Memory leak within the Alpha pods
pprof.dgraph.alloc_objects.alloc_space.inuse_objects.inuse_space.009.pb.gz pprof.dgraph.samples.cpu.001.pb.gz
This issue is Blocking our team, so any help would be greatly appreciated