Closed hanswurscht closed 7 years ago
This should be fixed in #20. Could you please try latest develop branch ? Thank you !
Hi,
the current development branch crashes right after some minutes:
2017-06-08T10:06:50.507+02:00 [APP/PROC/WEB/1] [ERR] [2017-06-08 08:06:50.507202958 +0000 UTC] Exception occurred! Message: Normal Websocket Closure: %v Details: websocket: close 1000 (normal)
2017-06-08T10:06:50.507+02:00 [APP/PROC/WEB/1] [ERR] [2017-06-08 08:06:50.507309452 +0000 UTC] Exception occurred! Message: Closing connection with traffic controller due to %v Details: websocket: close 1000 (normal)
2017-06-08T10:06:50.507+02:00 [APP/PROC/WEB/1] [OUT] {"timestamp":"1496909210.507362366","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Failed connecting to Firehose","log_level":3,"data":{"error":"websocket: close 1000 (normal)","trace":"goroutine 1 [running]:\ngithub.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager.(*logger).Fatal(0xc420018780, 0x845bab, 0x1d, 0xa331a0, 0xc4203d40c0, 0x0, 0x0, 0x0)\n\t/tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager/logger.go:131 +0xc7\nmain.main()\n\t/tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/main.go:146 +0xcf9\n"}}
2017-06-08T10:06:50.533+02:00 [APP/PROC/WEB/1] [ERR] panic(0x7c4660, 0xc4203d40c0)
2017-06-08T10:06:50.533+02:00 [APP/PROC/WEB/1] [ERR] /tmp/cache/go1.7.5/go/src/runtime/panic.go:500 +0x1a1
2017-06-08T10:06:50.533+02:00 [APP/PROC/WEB/1] [ERR] panic: websocket: close 1000 (normal)
2017-06-08T10:06:50.533+02:00 [APP/PROC/WEB/1] [ERR] goroutine 1 [running]:
2017-06-08T10:06:50.533+02:00 [APP/PROC/WEB/1] [ERR] github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager.(*logger).Fatal(0xc420018780, 0x845bab, 0x1d, 0xa331a0, 0xc4203d40c0, 0x0, 0x0, 0x0)
2017-06-08T10:06:50.533+02:00 [APP/PROC/WEB/1] [ERR] /tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager/logger.go:152 +0x41c
2017-06-08T10:06:50.533+02:00 [APP/PROC/WEB/1] [ERR] main.main()
2017-06-08T10:06:50.533+02:00 [APP/PROC/WEB/1] [ERR] /tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/main.go:146 +0xcf9
2017-06-08T10:06:50.535+02:00 [APP/PROC/WEB/1] [OUT] Exit status 2
2017-06-08T10:06:50.535+02:00 [APP/PROC/WEB/0] [ERR] [2017-06-08 08:06:50.535729455 +0000 UTC] Exception occurred! Message: Normal Websocket Closure: %v Details: websocket: close 1000 (normal)
2017-06-08T10:06:50.535+02:00 [APP/PROC/WEB/0] [ERR] [2017-06-08 08:06:50.535810809 +0000 UTC] Exception occurred! Message: Closing connection with traffic controller due to %v Details: websocket: close 1000 (normal)
2017-06-08T10:06:50.538+02:00 [APP/PROC/WEB/0] [OUT] {"timestamp":"1496909210.535969973","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Posting 50 events","log_level":1,"data":{}}
2017-06-08T10:06:50.538+02:00 [APP/PROC/WEB/0] [OUT] {"timestamp":"1496909210.535849810","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Failed connecting to Firehose","log_level":3,"data":{"error":"websocket: close 1000 (normal)","trace":"goroutine 1 [running]:\ngithub.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager.(*logger).Fatal(0xc42006e720, 0x845bab, 0x1d, 0xa331a0, 0xc4203fc300, 0x0, 0x0, 0x0)\n\t/tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager/logger.go:131 +0xc7\nmain.main()\n\t/tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/main.go:146 +0xcf9\n"}}
2017-06-08T10:06:50.547+02:00 [CELL/1] [OUT] Exit status 0
2017-06-08T10:06:50.550+02:00 [CELL/1] [OUT] Destroying container
2017-06-08T10:06:50.566+02:00 [APP/PROC/WEB/0] [ERR] panic: websocket: close 1000 (normal)
2017-06-08T10:06:50.566+02:00 [APP/PROC/WEB/0] [ERR] panic(0x7c4660, 0xc4203fc300)
2017-06-08T10:06:50.566+02:00 [APP/PROC/WEB/0] [ERR] github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager.(*logger).Fatal(0xc42006e720, 0x845bab, 0x1d, 0xa331a0, 0xc4203fc300, 0x0, 0x0, 0x0)
2017-06-08T10:06:50.566+02:00 [APP/PROC/WEB/0] [ERR] goroutine 1 [running]:
2017-06-08T10:06:50.566+02:00 [APP/PROC/WEB/0] [ERR] /tmp/cache/go1.7.5/go/src/runtime/panic.go:500 +0x1a1
2017-06-08T10:06:50.566+02:00 [APP/PROC/WEB/0] [ERR] /tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager/logger.go:152 +0x41c
2017-06-08T10:06:50.566+02:00 [APP/PROC/WEB/0] [ERR] /tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/main.go:146 +0xcf9
2017-06-08T10:06:50.566+02:00 [APP/PROC/WEB/0] [ERR] main.main()
Thanks!
It retried some time and gave up after failed to connect to firehose splunk-nozzle-logger.Failed connecting to Firehose"
.
Please make sure the API_ENDPOINT is correctly set
@hanswurscht did you still see the "Failed connection to Firehose" error ?
Hi,
yes :( Today at 00:04 there was a similar errror:
{"timestamp":"1497168265.662001848","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Failed connecting to Firehose","log_level":3,"data":{"error":"read tcp 10.10.10.10:12345-\u003e11.11.11.11:443: read: connection reset by peer","trace":"goroutine 1 [running]:\ngithub.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager.(*logger).Fatal(0xc42006e720, 0x845bab, 0x1d, 0xa33220, 0xc4203db4e0, 0x0, 0x0, 0x0)\n\t/tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager/logger.go:131 +0xc7\nmain.main()\n\t/tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/main.go:146 +0xcf9\n"}}
Is this related to our pcf-environment or problem within splunk-firehose-nozzle?
Thanks!
I didn't see this error in our lab for dev branch.
May I ask how did you run the nozzle? Can you telnet to the API endpoint? What is the output of "env" command ? And does the nozzle in master branch work ?
Thanks !
Hi,
in the last 24 hours there were 20 smiliar crashes, related to that error message. Inbetween it's just running fine.
Thanks!
read: connection reset by peer in the error message suggests that at some point the splunk-firehose-nozzle and the pcf-environment's sockets are getting in a bad state.
i.e writing to a socket that has been closed at the other end.
If you could respond with the answer to Kens question that would be great.
Hi,
It seems indeed that the client connection somehow gets terminated.
2017-06-19T09:21:18.827+02:00 [APP/PROC/WEB/1] [OUT] {"timestamp":"1497856878.827364445","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Failed connecting to Firehose","log_level":3,"data":{"error":"read tcp 10.10.10.10:59150-\u003e11.11.11.11:443: read: connection reset by peer","trace":"goroutine 1 [running]:\ngithub.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager.(*logger).Fatal(0xc420018780, 0x845bab, 0x1d, 0xa33220, 0xc4203b2340, 0x0, 0x0, 0x0)\n\t/tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager/logger.go:131 +0xc7\nmain.main()\n\t/tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/main.go:146 +0xcf9\n"}}
2017-06-19T09:21:18.827+02:00 [APP/PROC/WEB/1] [ERR] [2017-06-19 07:21:18.827285262 +0000 UTC] Exception occurred! Message: Closing connection with traffic controller due to %v Details: read tcp 10.10.10.10:59150->11.11.11.11:443: read: connection reset by peer
2017-06-19T09:21:18.827+02:00 [APP/PROC/WEB/1] [ERR] [2017-06-19 07:21:18.827141318 +0000 UTC] Exception occurred! Message: Error while reading from the firehose: %v Details: read tcp 10.10.1.10:59150->11.11.11.11:443: read: connection reset by peer
2017-06-19T09:21:18.854+02:00 [APP/PROC/WEB/1] [ERR] goroutine 1 [running]:
2017-06-19T09:21:18.854+02:00 [APP/PROC/WEB/1] [ERR] panic: read tcp 10.10.10.10:59150->11.11.11.11:443: read: connection reset by peer
2017-06-19T09:21:18.854+02:00 [APP/PROC/WEB/1] [ERR] panic(0x7d85e0, 0xc4203b2340)
2017-06-19T09:21:18.854+02:00 [APP/PROC/WEB/1] [ERR] /tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager/logger.go:152 +0x41c
2017-06-19T09:21:18.854+02:00 [APP/PROC/WEB/1] [ERR] /tmp/tmp.8ikMzDxZC4/.go/src/github.com/cloudfoundry-community/splunk-firehose-nozzle/main.go:146 +0xcf9
2017-06-19T09:21:18.854+02:00 [APP/PROC/WEB/1] [ERR] /tmp/cache/go1.7.5/go/src/runtime/panic.go:500 +0x1a1
2017-06-19T09:21:18.854+02:00 [APP/PROC/WEB/1] [ERR] github.com/cloudfoundry-community/splunk-firehose-nozzle/vendor/code.cloudfoundry.org/lager.(*logger).Fatal(0xc420018780, 0x845bab, 0x1d, 0xa33220, 0xc4203b2340, 0x0, 0x0, 0x0)
2017-06-19T09:21:18.854+02:00 [APP/PROC/WEB/1] [ERR] main.main()
- How spaced out are the crashes?
- Do they happen about an hour apart or are they sporadic?
The crashes do not follow any time Pattern sometimes there are hours in between an some other times there are only 10 Minutes...
- May I ask how did you run the nozzle?
It runs as an CloudFoundry app with in the CF platform
name: splunk-firehose-nozzle
requested state: started
instances: 2/2
usage: 256M x 2 instances
routes:
last uploaded: Thu 08 Jun 08:21:51 CEST 2017
stack: cflinuxfs2
buildpack: go_buildpack
state since cpu memory disk details
#0 running 2017-06-19T07:42:18Z 7.1% 9.9M of 256M 48.8M of 1G
#1 running 2017-06-19T07:21:19Z 6.8% 14.9M of 256M 48.8M of 1G
- Can you telnet to the API endpoint?
Yes I can telnet the API Endpoint form inside the CF Splunk Firehose app
- What is the output of "env" command ?
I removed all sensitive Informations
vcap@0ac7f550-25cc-48dc-736b-a7cbb479239e:~$ env
SPLUNK_INDEX=cf_dev
CF_INSTANCE_ADDR=**.**.**.**:60506
TERM=xterm
VCAP_APP_PORT=8080
USER=vcap
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow
=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01
;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31
:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=0
1;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.
svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=0
1;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.f
li=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*
.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;
36:*.axa=00;36:*.oga=00;36:*.spx=00;36:*.xspf=00;36:
CF_INSTANCE_INTERNAL_IP=10.11.12.13
VCAP_APPLICATION={"cf_api":"https://api...","limits":{"fds":16384,"mem":256,"disk":1024},"application_name":"splunk-firehos
e-nozzle","application_uris":[],"name":"splunk-firehose-nozzle","space_name":"splunk-firehose-nozzle","space_id":"5505b9b7-cbd3-41fe-b39a-5a
4c69f17f63","uris":[],"application_id":"bbc2c9e5-05a5-40d8-b1ed-e22dff7191ae","version":"1fb49a8f-a46f-45cf-8114-9a63da2e7b5e","application_
version":"1fb49a8f-a46f-45cf-8114-9a63da2e7b5e"}
CF_INSTANCE_GUID=0ac7f550-25cc-48dc-736b-a7cbb479239e
SPLUNK_TOKEN=<valie Splunk token>
SKIP_SSL_VALIDATION=false
API_ENDPOINT=https://<valid api endpoint>
PATH=/bin:/usr/bin
API_PASSWORD=<valid password>
PWD=/home/vcap
LANG=en_US.UTF-8
CF_INSTANCE_PORT=60506
VCAP_SERVICES={}
CF_INSTANCE_IP=***.***.***.***
CF_INSTANCE_INDEX=0
FOUNDATION=DEV
SHLVL=1
HOME=/home/vcap
API_USER=splunk-firehose
GOPACKAGENAME=github.com/cloudfoundry-community/splunk-firehose-nozzle
SPLUNK_HOST=https://<Splunk endpoint>
CF_INSTANCE_PORTS=[{"external":60506,"internal":8080},{"external":60507,"internal":2222}]
LESSOPEN=| /bin/lesspipe %s
INSTANCE_INDEX=0
PORT=8080
INSTANCE_GUID=0ac7f550-25cc-48dc-736b-a7cbb479239e
EVENTS=ValueMetric,CounterEvent,Error,LogMessage,HttpStartStop,ContainerMetric
ADD_APP_INFO=true
LESSCLOSE=/bin/lesspipe %s %s
MEMORY_LIMIT=256m
_=/usr/bin/env
- And does the nozzle in master branch work ?
It works better, but it seems to get in trouble from time to time. It looks like there is sometimes a big Chunk of data that is too big to forward by the nozzle and id fails during insufficient memory. On average it fails every 4 to five days and is not able to recover. The dev branch fails more often but is able to recover and send data even after a crash whereas the master can't do this.
Kind regards...
May i ask what is the EPS (event per second) firehose published in your prod ? And what is your deployment architecture ? for example: 2 Splunk Nozzle -> HFW (how many) -> indexers (how many) ?
Hi,
typical around 2000 eps, with some peaks during the day up to 4200.
Deployment architecture is as follwing: 2 Splunk-Nozzles -> 1 HW-Loadbalancer (F5) -> 15 Indexers
Thanks!
@hanswurscht Could you please try issue42-scale-up-nozzle branch https://github.com/cloudfoundry-community/splunk-firehose-nozzle/tree/feature/issue42-scale-up-nozzle
Hi,
that issue42-scale-up-nozzle branch is now running for 24 hours without any crash. Looking good so far!
Thanks!
@hanswurscht Thanks for verify this. We have merged this to dev branch.
Hi,
occasionally we see splunk-firehose-nozzle crash. This is the stacktrace: