Occasional crash - Githubissues

hanswurscht commented 7 years ago

Hi,

occasionally we see splunk-firehose-nozzle crash. This is the stacktrace:

fatal error: bad pointer in write barrier
            /tmp/cache/go1.7.4/go/src/runtime/mbarrier.go:168 +0x36
runtime.systemstack(0xc42001e000)
runtime.mstart()
goroutine 1 [running]:
            /tmp/cache/go1.7.4/go/src/runtime/asm_amd64.s:252 fp=0xc420237798 sp=0xc420237790
runtime.heapBitsBulkBarrier(0xc420778048, 0x8)
            /tmp/cache/go1.7.4/go/src/runtime/mbitmap.go:618 +0xcf fp=0xc420237818 sp=0xc4202377b8
runtime.typedmemmove(0x7c9d20, 0xc420778048, 0xa74ec0)
            /tmp/cache/go1.7.4/go/src/runtime/mbarrier.go:183 +0x65 fp=0xc420237850 sp=0xc420237818
            /tmp/cache/go1.7.4/go/src/runtime/slice.go:136 +0x1b3 fp=0xc4202378e0 sp=0xc420237850
github.com/cloudfoundry-community/splunk-firehose-nozzle/firehoseclient.(*FirehoseNozzle).routeEvent(0xc420237ea0, 0x86fb28, 0xc420014a40)
main.main()
            /tmp/tmp.UQ8nIP77aU/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/main.go:132 +0xa0f fp=0xc420237f48 sp=0xc420237ca8
runtime.main()
            /tmp/cache/go1.7.4/go/src/runtime/proc.go:183 +0x1f4 fp=0xc420237fa0 sp=0xc420237f48
            /tmp/cache/go1.7.4/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc420237fa8 sp=0xc420237fa0
runtime.goexit()
            /tmp/cache/go1.7.4/go/src/runtime/asm_amd64.s:2086 +0x1
github.com/cloudfoundry-community/splunk-firehose-nozzle/drain.(*LoggingSplunk).consume(0xc420014a40)
created by github.com/cloudfoundry-community/splunk-firehose-nozzle/drain.(*LoggingSplunk).Connect
github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/github.com/cloudfoundry/noaa/consumer.(*Consumer).listenForMessages(0xc42029e1e0, 0xc4202a72e0, 0xc4202aa560, 0x0, 0x0)
            /tmp/tmp.UQ8nIP77aU/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/github.com/cloudfoundry/noaa/consumer/async.go:197 +0xbe
            /tmp/tmp.UQ8nIP77aU/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/github.com/cloudfoundry/noaa/consumer/async.go:243 +0x112
            /tmp/tmp.UQ8nIP77aU/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/github.com/cloudfoundry/noaa/consumer/async.go:179 +0xf7

chenziliang commented 7 years ago

This should be fixed in #20. Could you please try latest develop branch ? Thank you !

hanswurscht commented 7 years ago

Hi,

the current development branch crashes right after some minutes:

2017-06-08T10:06:50.507+02:00 [APP/PROC/WEB/1] [ERR] [2017-06-08 08:06:50.507202958 +0000 UTC] Exception occurred! Message: Normal Websocket Closure: %v Details: websocket: close 1000 (normal)
2017-06-08T10:06:50.507+02:00 [APP/PROC/WEB/1] [ERR] [2017-06-08 08:06:50.507309452 +0000 UTC] Exception occurred! Message: Closing connection with traffic controller due to %v Details: websocket: close 1000 (normal)
2017-06-08T10:06:50.507+02:00 [APP/PROC/WEB/1] [OUT] {"timestamp":"1496909210.507362366","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Failed connecting to Firehose","log_level":3,"data":{"error":"websocket: close 1000 (normal)","trace":"goroutine 1 [running]:\ngithub.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager.(*logger).Fatal(0xc420018780, 0x845bab, 0x1d, 0xa331a0, 0xc4203d40c0, 0x0, 0x0, 0x0)\n\t/tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager/logger.go:131 +0xc7\nmain.main()\n\t/tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/main.go:146 +0xcf9\n"}}
2017-06-08T10:06:50.533+02:00 [APP/PROC/WEB/1] [ERR] panic(0x7c4660, 0xc4203d40c0)
2017-06-08T10:06:50.533+02:00 [APP/PROC/WEB/1] [ERR] /tmp/cache/go1.7.5/go/src/runtime/panic.go:500 +0x1a1
2017-06-08T10:06:50.533+02:00 [APP/PROC/WEB/1] [ERR] panic: websocket: close 1000 (normal)
2017-06-08T10:06:50.533+02:00 [APP/PROC/WEB/1] [ERR] goroutine 1 [running]:
2017-06-08T10:06:50.533+02:00 [APP/PROC/WEB/1] [ERR] github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager.(*logger).Fatal(0xc420018780, 0x845bab, 0x1d, 0xa331a0, 0xc4203d40c0, 0x0, 0x0, 0x0)
2017-06-08T10:06:50.533+02:00 [APP/PROC/WEB/1] [ERR] /tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager/logger.go:152 +0x41c
2017-06-08T10:06:50.533+02:00 [APP/PROC/WEB/1] [ERR] main.main()
2017-06-08T10:06:50.533+02:00 [APP/PROC/WEB/1] [ERR] /tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/main.go:146 +0xcf9
2017-06-08T10:06:50.535+02:00 [APP/PROC/WEB/1] [OUT] Exit status 2
2017-06-08T10:06:50.535+02:00 [APP/PROC/WEB/0] [ERR] [2017-06-08 08:06:50.535729455 +0000 UTC] Exception occurred! Message: Normal Websocket Closure: %v Details: websocket: close 1000 (normal)
2017-06-08T10:06:50.535+02:00 [APP/PROC/WEB/0] [ERR] [2017-06-08 08:06:50.535810809 +0000 UTC] Exception occurred! Message: Closing connection with traffic controller due to %v Details: websocket: close 1000 (normal)
2017-06-08T10:06:50.538+02:00 [APP/PROC/WEB/0] [OUT] {"timestamp":"1496909210.535969973","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Posting 50 events","log_level":1,"data":{}}
2017-06-08T10:06:50.538+02:00 [APP/PROC/WEB/0] [OUT] {"timestamp":"1496909210.535849810","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Failed connecting to Firehose","log_level":3,"data":{"error":"websocket: close 1000 (normal)","trace":"goroutine 1 [running]:\ngithub.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager.(*logger).Fatal(0xc42006e720, 0x845bab, 0x1d, 0xa331a0, 0xc4203fc300, 0x0, 0x0, 0x0)\n\t/tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager/logger.go:131 +0xc7\nmain.main()\n\t/tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/main.go:146 +0xcf9\n"}}
2017-06-08T10:06:50.547+02:00 [CELL/1] [OUT] Exit status 0
2017-06-08T10:06:50.550+02:00 [CELL/1] [OUT] Destroying container
2017-06-08T10:06:50.566+02:00 [APP/PROC/WEB/0] [ERR] panic: websocket: close 1000 (normal)
2017-06-08T10:06:50.566+02:00 [APP/PROC/WEB/0] [ERR] panic(0x7c4660, 0xc4203fc300)
2017-06-08T10:06:50.566+02:00 [APP/PROC/WEB/0] [ERR] github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager.(*logger).Fatal(0xc42006e720, 0x845bab, 0x1d, 0xa331a0, 0xc4203fc300, 0x0, 0x0, 0x0)
2017-06-08T10:06:50.566+02:00 [APP/PROC/WEB/0] [ERR] goroutine 1 [running]:
2017-06-08T10:06:50.566+02:00 [APP/PROC/WEB/0] [ERR] /tmp/cache/go1.7.5/go/src/runtime/panic.go:500 +0x1a1
2017-06-08T10:06:50.566+02:00 [APP/PROC/WEB/0] [ERR] /tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager/logger.go:152 +0x41c
2017-06-08T10:06:50.566+02:00 [APP/PROC/WEB/0] [ERR] /tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/main.go:146 +0xcf9
2017-06-08T10:06:50.566+02:00 [APP/PROC/WEB/0] [ERR] main.main()

Thanks!

chenziliang commented 7 years ago

It retried some time and gave up after failed to connect to firehose splunk-nozzle-logger.Failed connecting to Firehose".

Please make sure the API_ENDPOINT is correctly set

chenziliang commented 7 years ago

@hanswurscht did you still see the "Failed connection to Firehose" error ?

hanswurscht commented 7 years ago

Hi,

yes :( Today at 00:04 there was a similar errror:

{"timestamp":"1497168265.662001848","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Failed connecting to Firehose","log_level":3,"data":{"error":"read tcp 10.10.10.10:12345-\u003e11.11.11.11:443: read: connection reset by peer","trace":"goroutine 1 [running]:\ngithub.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager.(*logger).Fatal(0xc42006e720, 0x845bab, 0x1d, 0xa33220, 0xc4203db4e0, 0x0, 0x0, 0x0)\n\t/tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager/logger.go:131 +0xc7\nmain.main()\n\t/tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/main.go:146 +0xcf9\n"}}

Is this related to our pcf-environment or problem within splunk-firehose-nozzle?

Thanks!

chenziliang commented 7 years ago

I didn't see this error in our lab for dev branch.

May I ask how did you run the nozzle? Can you telnet to the API endpoint? What is the output of "env" command ? And does the nozzle in master branch work ?

Thanks !

hanswurscht commented 7 years ago

Hi,

in the last 24 hours there were 20 smiliar crashes, related to that error message. Inbetween it's just running fine.

Thanks!

dtregonning commented 7 years ago

read: connection reset by peer in the error message suggests that at some point the splunk-firehose-nozzle and the pcf-environment's sockets are getting in a bad state.

i.e writing to a socket that has been closed at the other end.

How spaced out are the crashes?
Do they happen about an hour apart or are they sporadic?

If you could respond with the answer to Kens question that would be great.

May I ask how did you run the nozzle?
Can you telnet to the API endpoint?
What is the output of "env" command ?
And does the nozzle in master branch work ?

hanswurscht commented 7 years ago

Hi,

It seems indeed that the client connection somehow gets terminated.

2017-06-19T09:21:18.827+02:00 [APP/PROC/WEB/1] [OUT] {"timestamp":"1497856878.827364445","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Failed connecting to Firehose","log_level":3,"data":{"error":"read tcp 10.10.10.10:59150-\u003e11.11.11.11:443: read: connection reset by peer","trace":"goroutine 1 [running]:\ngithub.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager.(*logger).Fatal(0xc420018780, 0x845bab, 0x1d, 0xa33220, 0xc4203b2340, 0x0, 0x0, 0x0)\n\t/tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager/logger.go:131 +0xc7\nmain.main()\n\t/tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/main.go:146 +0xcf9\n"}}
2017-06-19T09:21:18.827+02:00 [APP/PROC/WEB/1] [ERR] [2017-06-19 07:21:18.827285262 +0000 UTC] Exception occurred! Message: Closing connection with traffic controller due to %v Details: read tcp 10.10.10.10:59150->11.11.11.11:443: read: connection reset by peer
2017-06-19T09:21:18.827+02:00 [APP/PROC/WEB/1] [ERR] [2017-06-19 07:21:18.827141318 +0000 UTC] Exception occurred! Message: Error while reading from the firehose: %v Details: read tcp 10.10.1.10:59150->11.11.11.11:443: read: connection reset by peer
2017-06-19T09:21:18.854+02:00 [APP/PROC/WEB/1] [ERR] goroutine 1 [running]:
2017-06-19T09:21:18.854+02:00 [APP/PROC/WEB/1] [ERR] panic: read tcp 10.10.10.10:59150->11.11.11.11:443: read: connection reset by peer
2017-06-19T09:21:18.854+02:00 [APP/PROC/WEB/1] [ERR] panic(0x7d85e0, 0xc4203b2340)
2017-06-19T09:21:18.854+02:00 [APP/PROC/WEB/1] [ERR] /tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager/logger.go:152 +0x41c
2017-06-19T09:21:18.854+02:00 [APP/PROC/WEB/1] [ERR] /tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/main.go:146 +0xcf9
2017-06-19T09:21:18.854+02:00 [APP/PROC/WEB/1] [ERR] /tmp/cache/go1.7.5/go/src/runtime/panic.go:500 +0x1a1
2017-06-19T09:21:18.854+02:00 [APP/PROC/WEB/1] [ERR] github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager.(*logger).Fatal(0xc420018780, 0x845bab, 0x1d, 0xa33220, 0xc4203b2340, 0x0, 0x0, 0x0)
2017-06-19T09:21:18.854+02:00 [APP/PROC/WEB/1] [ERR] main.main()

How spaced out are the crashes?

Do they happen about an hour apart or are they sporadic?

The crashes do not follow any time Pattern sometimes there are hours in between an some other times there are only 10 Minutes...

May I ask how did you run the nozzle?

It runs as an CloudFoundry app with in the CF platform

name:              splunk-firehose-nozzle
requested state:   started
instances:         2/2
usage:             256M x 2 instances
routes:
last uploaded:     Thu 08 Jun 08:21:51 CEST 2017
stack:             cflinuxfs2
buildpack:         go_buildpack

     state     since                  cpu    memory          disk          details
#0   running   2017-06-19T07:42:18Z   7.1%   9.9M of 256M    48.8M of 1G
#1   running   2017-06-19T07:21:19Z   6.8%   14.9M of 256M   48.8M of 1G

Can you telnet to the API endpoint?

Yes I can telnet the API Endpoint form inside the CF Splunk Firehose app

What is the output of "env" command ?

I removed all sensitive Informations

vcap@0ac7f550-25cc-48dc-736b-a7cbb479239e:~$ env
SPLUNK_INDEX=cf_dev
CF_INSTANCE_ADDR=**.**.**.**:60506
TERM=xterm
VCAP_APP_PORT=8080
USER=vcap
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow
=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01
;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31
:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=0
1;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.
svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=0
1;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.f
li=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*
.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;
36:*.axa=00;36:*.oga=00;36:*.spx=00;36:*.xspf=00;36:
CF_INSTANCE_INTERNAL_IP=10.11.12.13
VCAP_APPLICATION={"cf_api":"https://api...","limits":{"fds":16384,"mem":256,"disk":1024},"application_name":"splunk-firehos
e-nozzle","application_uris":[],"name":"splunk-firehose-nozzle","space_name":"splunk-firehose-nozzle","space_id":"5505b9b7-cbd3-41fe-b39a-5a
4c69f17f63","uris":[],"application_id":"bbc2c9e5-05a5-40d8-b1ed-e22dff7191ae","version":"1fb49a8f-a46f-45cf-8114-9a63da2e7b5e","application_
version":"1fb49a8f-a46f-45cf-8114-9a63da2e7b5e"}
CF_INSTANCE_GUID=0ac7f550-25cc-48dc-736b-a7cbb479239e
SPLUNK_TOKEN=<valie Splunk token>
SKIP_SSL_VALIDATION=false
API_ENDPOINT=https://<valid api endpoint>
PATH=/bin:/usr/bin
API_PASSWORD=<valid password>
PWD=/home/vcap
LANG=en_US.UTF-8
CF_INSTANCE_PORT=60506
VCAP_SERVICES={}
CF_INSTANCE_IP=***.***.***.***
CF_INSTANCE_INDEX=0
FOUNDATION=DEV
SHLVL=1
HOME=/home/vcap
API_USER=splunk-firehose
GOPACKAGENAME=github.com/cloudfoundry-community/splunk-firehose-nozzle
SPLUNK_HOST=https://<Splunk endpoint>
CF_INSTANCE_PORTS=[{"external":60506,"internal":8080},{"external":60507,"internal":2222}]
LESSOPEN=| /bin/lesspipe %s
INSTANCE_INDEX=0
PORT=8080
INSTANCE_GUID=0ac7f550-25cc-48dc-736b-a7cbb479239e
EVENTS=ValueMetric,CounterEvent,Error,LogMessage,HttpStartStop,ContainerMetric
ADD_APP_INFO=true
LESSCLOSE=/bin/lesspipe %s %s
MEMORY_LIMIT=256m
_=/usr/bin/env

And does the nozzle in master branch work ?

It works better, but it seems to get in trouble from time to time. It looks like there is sometimes a big Chunk of data that is too big to forward by the nozzle and id fails during insufficient memory. On average it fails every 4 to five days and is not able to recover. The dev branch fails more often but is able to recover and send data even after a crash whereas the master can't do this.

Kind regards...

chenziliang commented 7 years ago

May i ask what is the EPS (event per second) firehose published in your prod ? And what is your deployment architecture ? for example: 2 Splunk Nozzle -> HFW (how many) -> indexers (how many) ?

hanswurscht commented 7 years ago

Hi,

typical around 2000 eps, with some peaks during the day up to 4200.

Deployment architecture is as follwing: 2 Splunk-Nozzles -> 1 HW-Loadbalancer (F5) -> 15 Indexers

Thanks!

chenziliang commented 7 years ago

@hanswurscht Could you please try issue42-scale-up-nozzle branch https://github.com/cloudfoundry-community/splunk-firehose-nozzle/tree/feature/issue42-scale-up-nozzle

hanswurscht commented 7 years ago

Hi,

that issue42-scale-up-nozzle branch is now running for 24 hours without any crash. Looking good so far!

Thanks!

chenziliang commented 7 years ago

@hanswurscht Thanks for verify this. We have merged this to dev branch.

cloudfoundry-community / splunk-firehose-nozzle

Occasional crash #24