Closed JimWen closed 4 years ago
@JimWen can you show me the contents of the p
directory on which the crash happened? From the information in the crash it looks like the builder was trying to build a table with around 4.0G size (4172166119 == 4.0G). The usual size for tables is 70 MB so this size is definitely odd.
I want to see the size of the other SST files in this p
directory.
@jarifibrahim I'm afraid that i can't show you the contents of the p directory because i have to clear all the alpha data and reimport data to solve this problem .
@JimWen Does the crash show up when you bulk load a lot of data like 5 billion nquads? I'll try to insert 5 billion nquads in dgraph and see how it goes.
@JimWen Does the crash show up when you bulk load a lot of data like 5 billion nquads? I'll try to insert 5 billion nquads in dgraph and see how it goes.
No, just happen when liveload or online mutation through api.
No, just happen when liveload or online mutation through api.
Got it. I'm running dgraph on 2billion rdf. Let me see if it goes through.
@JimWen my 2billion test didn't crash. Do you have any suggestions or maybe specifics what you did so that I can reproduce the failure? I haven't tried using the java API because that shouldn't have any effect on this crash.
@JimWen my 2billion test didn't crash. Do you have any suggestions or maybe specifics what you did so that I can reproduce the failure? I haven't tried using the java API because that shouldn't have any effect on this crash.
1. If the schema is like type User { follower:uid } But actually the write data is [uid] which means 1-N, would this just cause data lose or may cause crash?
@jarifibrahim it crash again after wrting on line for 7days using v20.03.3, the total data is about 7 billion edges. it works on Ludicrous Mode and crash log is as followings:
panic: runtime error: slice bounds out of range [:4172166156] with capacity 1963281920
goroutine 21418 [running]: github.com/dgraph-io/badger/v2/table.(Builder).addBlockToIndex(...) /go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/table/builder.go:296 github.com/dgraph-io/badger/v2/table.(Builder).finishBlock(0xc1b382c180) /go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/table/builder.go:290 +0x64d github.com/dgraph-io/badger/v2/table.(Builder).Add(0xc1b382c180, 0xc18ac559c0, 0x1e, 0x20, 0x440, 0x0, 0xc13621a025, 0x154, 0x13db, 0x0, ...) /go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/table/builder.go:333 +0xf8 github.com/dgraph-io/badger/v2.(levelsController).compactBuildTables(0xc0001c2070, 0x2, 0x1b81ee0, 0xc06930d040, 0xc00018e8a0, 0xc00018e900, 0xc21da10768, 0x1, 0x1, 0xc21da10770, ...) /go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/levels.go:613 +0x90b github.com/dgraph-io/badger/v2.(levelsController).runCompactDef(0xc0001c2070, 0x2, 0x1b81ee0, 0xc06930d040, 0xc00018e8a0, 0xc00018e900, 0xc21da10768, 0x1, 0x1, 0xc21da10770, ...) /go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/levels.go:862 +0xe1 github.com/dgraph-io/badger/v2.(levelsController).doCompact(0xc0001c2070, 0x2, 0x3ff0303383000000, 0x0, 0x0, 0x0, 0x0, 0x0) /go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/levels.go:931 +0x49d github.com/dgraph-io/badger/v2.(levelsController).runWorker(0xc0001c2070, 0xc01e3893e0) /go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/levels.go:368 +0x2fd created by github.com/dgraph-io/badger/v2.(levelsController).startCompact /go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/levels.go:343 +0x88 "log/alpha1.log" 29871L, 2672378C
And i also check the p directory, there is no table near 4G size, top size file list is as followings
-rw------- 1 work work 70M Jul 4 13:51 001918.vlog -rw------- 1 work work 70M Jul 5 16:51 002277.vlog -rw------- 1 work work 70M Jul 4 14:41 001940.vlog -rw------- 1 work work 70M Jul 6 12:41 002565.vlog -rw------- 1 work work 70M Jul 4 09:00 001796.vlog -rw------- 1 work work 70M Jul 4 16:51 001994.vlog -rw------- 1 work work 70M Jul 4 13:40 001914.vlog -rw------- 1 work work 70M Jul 4 09:11 001800.vlog -rw------- 1 work work 70M Jul 6 12:51 002569.vlog -rw------- 1 work work 70M Jul 4 15:50 001970.vlog -rw------- 1 work work 70M Jul 4 16:20 001982.vlog -rw------- 1 work work 70M Jul 5 16:01 002260.vlog -rw------- 1 work work 70M Jul 4 16:30 001986.vlog -rw------- 1 work work 70M Jul 4 16:10 001978.vlog -rw------- 1 work work 70M Jul 5 19:01 002314.vlog -rw------- 1 work work 70M Jul 5 15:01 002239.vlog -rw------- 1 work work 70M Jul 5 17:01 002280.vlog -rw------- 1 work work 70M Jul 5 11:41 002178.vlog -rw------- 1 work work 70M Jul 5 16:31 002270.vlog -rw------- 1 work work 70M Jul 5 16:21 002267.vlog -rw------- 1 work work 70M Jul 5 09:41 002133.vlog -rw------- 1 work work 70M Jul 5 21:41 002352.vlog -rw------- 1 work work 70M Jul 4 17:01 001998.vlog -rw------- 1 work work 70M Jul 4 17:21 002005.vlog -rw------- 1 work work 70M Jul 3 20:00 001714.vlog -rw------- 1 work work 70M Jul 5 21:01 002342.vlog -rw------- 1 work work 70M Jul 3 20:51 001726.vlog -rw------- 1 work work 71M Jul 5 14:01 002218.vlog -rw------- 1 work work 71M Jul 4 08:50 001793.vlog -rw------- 1 work work 71M Jul 4 12:01 001885.vlog -rw------- 1 work work 71M Jul 5 12:41 002195.vlog -rw------- 1 work work 71M Jul 4 17:40 002012.vlog -rw------- 1 work work 71M Jul 3 12:21 001499.vlog -rw------- 1 work work 71M Jul 4 12:51 001899.vlog -rw------- 1 work work 71M Jul 3 18:51 001696.vlog -rw------- 1 work work 71M Jul 5 12:21 002190.vlog -rw------- 1 work work 71M Jul 3 19:20 001704.vlog -rw------- 1 work work 71M Jul 4 12:11 001888.vlog -rw------- 1 work work 71M Jul 2 17:31 001216.vlog -rw------- 1 work work 71M Jul 4 18:20 002024.vlog -rw------- 1 work work 72M Jul 4 10:50 001851.vlog -rw------- 1 work work 146M Jul 6 11:50 036406.sst -rw------- 1 work work 202M Jul 6 12:01 036593.sst -rw------- 1 work work 249M Jul 6 10:01 034726.sst -rw------- 1 work work 327M Jul 5 22:42 032620.sst -rw------- 1 work work 422M Jul 5 11:21 029259.sst -rw------- 1 work work 459M Jul 5 14:40 030111.sst -rw------- 1 work work 501M Jul 6 10:01 034725.sst -rw------- 1 work work 719M Jul 6 10:56 035595.sst -rw------- 1 work work 1.1G Jul 6 10:12 034913.sst -rw------- 1 work work 1.3G Jul 6 12:02 036595.sst
And the zero state is as followings
{"counter":"27638547","groups":{"1":{"members":{"1":{"id":"1","groupId":1,"addr":"172.21.46.35:7080","leader":true,"lastUpdate":"1593611550"}},"tablets":{"chl":{"groupId":1,"predicate":"chl","space":"4695775026"},"dgraph.graphql.schema":{"groupId":1,"predicate":"dgraph.graphql.schema"},"dgraph.graphql.xid":{"groupId":1,"predicate":"dgraph.graphql.xid"},"dgraph.type":{"groupId":1,"predicate":"dgraph.type","space":"41421216460"},"extmsg":{"groupId":1,"predicate":"extmsg"},"from":{"groupId":1,"predicate":"from","space":"223127171705"},"net":{"groupId":1,"predicate":"net","space":"4220898548"},"title":{"groupId":1,"predicate":"title","space":"12325089358"},"to":{"groupId":1,"predicate":"to","space":"13864305792"},"withExpect":{"groupId":1,"predicate":"withExpect","space":"10265741211"}},"snapshotTs":"54649802","checksum":"3759641937265695043"},"2":{"members":{"2":{"id":"2","groupId":2,"addr":"172.21.46.43:7080","leader":true,"lastUpdate":"1593611555"}},"tablets":{"aid":{"groupId":2,"predicate":"aid","space":"12185695440"},"gid_gps":{"groupId":2,"predicate":"gid_gps","space":"2246383293"},"gid_ip":{"groupId":2,"predicate":"gid_ip","space":"4275951255"},"gid_user":{"groupId":2,"predicate":"gid_user","space":"10186607933"},"src":{"groupId":2,"predicate":"src","space":"9066055157"}},"snapshotTs":"54848420","checksum":"16815000284943703647"},"3":{"members":{"3":{"id":"3","groupId":3,"addr":"172.21.46.44:7080","leader":true,"lastUpdate":"1593611557"}},"tablets":{"gid_device":{"groupId":3,"predicate":"gid_device","space":"5483695993"},"gid_did":{"groupId":3,"predicate":"gid_did","space":"304280668"},"gid_idfa":{"groupId":3,"predicate":"gid_idfa","space":"1199109308"},"gid_wifi":{"groupId":3,"predicate":"gid_wifi","space":"1457055681"},"loc":{"groupId":3,"predicate":"loc","space":"16879986572"},"status":{"groupId":3,"predicate":"status","space":"354027403"},"t":{"groupId":3,"predicate":"t","space":"10332890221"}},"snapshotTs":"55094804","checksum":"3247556968549744124"},"4":{"members":{"4":{"id":"4","groupId":4,"addr":"172.21.46.59:7080","leader":true,"lastUpdate":"1593611559"}},"tablets":{"gid_brand":{"groupId":4,"predicate":"gid_brand","moveTs":"39"},"gid_job":{"groupId":4,"predicate":"gid_job","space":"3165219418"},"gid_mac":{"groupId":4,"predicate":"gid_mac","space":"1763035672"},"msg":{"groupId":4,"predicate":"msg","space":"176492074"},"pl":{"groupId":4,"predicate":"pl","space":"9285853378"},"withDID":{"groupId":4,"predicate":"withDID","space":"114209649"},"withDeviceId":{"groupId":4,"predicate":"withDeviceId","space":"14719347936"},"withIdfa":{"groupId":4,"predicate":"withIdfa","moveTs":"24"},"withIp":{"groupId":4,"predicate":"withIp","space":"14052854988"},"withMac":{"groupId":4,"predicate":"withMac","space":"3683865987"}},"snapshotTs":"54637590","checksum":"876440663310665151"},"5":{"members":{"5":{"id":"5","groupId":5,"addr":"172.21.46.65:7080","leader":true,"lastUpdate":"1593611561"}},"tablets":{"bg":{"groupId":5,"predicate":"bg","space":"10036594964"},"gid_company":{"groupId":5,"predicate":"gid_company","moveTs":"40"},"gid_expect":{"groupId":5,"predicate":"gid_expect","space":"5837675755"},"gid_interview":{"groupId":5,"predicate":"gid_interview","space":"174559486"},"withBrand":{"groupId":5,"predicate":"withBrand","moveTs":"28"},"withCompany":{"groupId":5,"predicate":"withCompany","moveTs":"29"},"withGps":{"groupId":5,"predicate":"withGps","space":"7782805485"},"withInterview":{"groupId":5,"predicate":"withInterview","space":"221288458"},"withJob":{"groupId":5,"predicate":"withJob","space":"10772862262"},"withWifi":{"groupId":5,"predicate":"withWifi","space":"4211920663"}},"snapshotTs":"54489876","checksum":"11427293922357654662"}},"zeros":{"1":{"id":"1","addr":"172.21.46.35:5080","leader":true}},"maxLeaseId":"15000000000000020000","maxTxnTs":"55300000","maxRaftId":"5","cid":"dca06408-70ba-4f46-ae60-7df73f943dd9","license":{"maxNodes":"18446744073709551615","expiryTs":"1596203542","enabled":true}}
And here is my exec param
if [ $# -lt 2 ]; then echo 'too less input, at least 2' exit 1 fi
type=$1 offset=$2
echo "type:$type, offset:$offset"
path=./log if [ ! -d $path ]; then mkdir $path fi
replica=1 zero_1=172.21.46.35 dgpath=/data/app/Dgraph/bin/dgraph uipath=/data/app/Dgraph/bin/dgraph-ratel cfpath=/data/app/Dgraph/config/config.json
local_host="
hostname --fqdn
" local_ip=/sbin/ifconfig -a|grep inet|grep -v 127.0.0.1|grep -v inet6|awk '{print $2}'|tr -d "addr:"
echo "ip:$local_ip"
case $type in zero1) index=
expr $offset + 1
port=expr $offset + 5080
nohup $dgpath zero --enable_sentry=false --ludicrous_mode --my=$local_ip:$port -o $offset --idx $index --replicas $replica --config $cfpath --cwd zero$index --rebalance_interval 12000h > ./log/zero$index.log 2>&1 & ;; zero) index=expr $offset + 1
port=expr $offset + 5080
nohup $dgpath zero --enable_sentry=false --ludicrous_mode --my=$local_ip:$port -o $offset --idx $index --replicas $replica --config $cfpath --cwd zero$index --peer $zero_1:5080 > ./log/zero$index.log 2>&1 & ;; alpha) index=expr $offset + 1
port=expr $offset + 7080
nohup $dgpath alpha --enable_sentry=false --ludicrous_mode --snapshot_after=1000000 --abort_older_than 15m --my=$local_ip:$port -o $offset --zero=$zero_1:5080 --config $cfpath --cwd alpha$index > ./log/alpha$index.log 2>&1 & ;; ratel) index=expr $offset + 1
port=expr $offset + 8000
nohup $uipath -addr "http://$local_ip:$port" -port $port > ./log/web$index.log 2>&1 & ;; *) echo 'wrong type input' ;; esac
@JimWen can you share alpha logs as well?
@JimWen can you share alpha logs as well?
Sorry to that i have clear all the data to run it not on Ludicrous Mode . I notice nothing unnormal except the crash log in that file.
@JimWen Can you save the big SST files and the logs next time it crashes? I'd like to look at the SST files which are over 500 MB. These files should be 65 MB by default. If the size goes anywhere around 4 GB, badger will crash because of uint32 overflow.
@jarifibrahim ok, i will save it next time.
@JimWen would you be able to share your rdfs? It might be possible that this crash is because of some specific input. I don't know how plausible is that but I'd like to explore that option as well. You can send it to Ibrahim[at]dgraph.io
@JimWen would you be able to share your rdfs? It might be possible that this crash is because of some specific input. I don't know how plausible is that but I'd like to explore that option as well. You can send it to Ibrahim[at]dgraph.io
@jarifibrahim Sorry, i‘m afraid that i can't share these business data.
hi @JimWen We would like to setup a call with you to see if we can further troubleshoot this problem. Can I contact you on your email to set up a session with @jarifibrahim?
@jarifibrahim do we have any tools to help inspect keys in sst file? It also crash when running not on Ludicrous Mode, and is it normal if sst file size is bigger than 65 MB?
@JimWen dgraph debug
tool can show the keys and its metadata. It's not normal to have an sst file larger than 65 MB. The SST are bigger than 65MB only if you have too many duplicate keys. To get a ~3 GB of sst file, you'll need too many duplicate keys (which doesn't happen under normal usage).
Badger uses uint32 at multiple places and a values bigger than 4 GB will cause the uin32 to overflow and cause a crash.
Github issues have been deprecated. This issue has been moved to discuss. You can follow the conversation there and also subscribe to updates by changing your notification preferences.
What version of Dgraph are you using?
Dgraph version : v20.03.1 Dgraph SHA-256 : 6a40b1e084205ae9e29336780b3458a3869db45c0b96b916190881c16d705ba8 Commit SHA-1 : c201611d6 Commit timestamp : 2020-04-24 13:53:41 -0700 Branch : HEAD Go version : go1.14.1
Have you tried reproducing the issue with the latest release?
yes
What is the hardware spec (RAM, OS)?
128G mem & 1.8T SSD
Linux version 3.10.0-1062.9.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Dec 6 15:49:49 UTC 2019
Steps to reproduce the issue (command/config used to run Dgraph).
Online write data through java api or liveload, it's ok at first days, but crash after some days.
Expected behaviour and actual result.
The crash log is as followings, it seems that this is related to badger.