Closed nickumia-reisys closed 8 months ago
both prod
and staging
use the logstack-shipper app in the management
space, setting up one for staging only in the management-staging
space for testing.
cf-drain-cli plugin
was deprecated. to install drains
plugin, download from https://github.com/cloudfoundry/cf-drain-cli/releases/tag/v2.0.0 and install it from binary.
pushed to management-staging
, got message to logstack-shipper
, but with following error:
x_cf_routererror:"endpoint_failure (tls: failed to verify certificate: x509: certificate is not valid for any names
By setting up the space-drain in development, all apps log from development can be sent to New Relic now.
Dev/Staging/Prod currently use one logstash shipper from management
space.
Ideally we would like use separate shipper from each environments, but we got setup error like below:
OUT Using bundled JDK: /home/vcap/app/logstash-7.16.3/jdk
ERR OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
ERR ERROR: File not found for: file:///home/vcap/app/plugins.zip, message: Can't file local file /home/vcap/app/plugins.zip
OUT Installing Cloud Foundry root CA certificate...
OUT Installing certificates: /etc/cf-system-certificates/*
keytool error: **java.io.FileNotFoundException: /etc/cf-system-certificates/*** (No such file or directory)
OUT Invoking start command.
OUT Using bundled JDK: /home/vcap/app/logstash-7.16.3/jdk
ERR OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
ERR warning: thread "Converge PipelineAction::Create main " terminated with exception (report_on_exception is true):
ERR LogStash::Error: **Don't know how to handle `Java::JavaLang::IllegalStateException` for `PipelineAction::Create<main>`**
ERR create at org/logstash/execution/ConvergeResultExt.java:135
ERR add at org/logstash/execution/ConvergeResultExt.java:60
ERR converge_state at **/home/vcap/app/logstash-7.16.3/logstash-core/lib/logstash/agent.rb**:396
[ERROR][logstash.agent ] An exception happened when converging configuration {:exception=>LogStash::Error, :message=>"Don't know how to handle `Java::JavaLang::IllegalStateException` for `PipelineAction::Create<main>`"}
OUT [2023-11-22T17:59:27,912][FATAL][logstash.runner ] An unexpected error occurred! {:error=>#<LogStash::Error: Don't know how to handle `Java::JavaLang::IllegalStateException` for `PipelineAction::Create<main>`>, :backtrace=>["org/logstash/execution/ConvergeResultExt.java:135:in `create'", "org/logstash/execution/ConvergeResultExt.java:60:in `add'", "/home/vcap/app/logstash-7.16.3/logstash-core/lib/logstash/agent.rb:396:in `block in converge_state'"]}
OUT [2023-11-22T17:59:27,918][FATAL][org.logstash.Logstash ] Logstash stopped processing because of an error: (SystemExit) exit
OUT org.jruby.exceptions.SystemExit: (SystemExit) exit
OUT at org.jruby.RubyKernel.exit(org/jruby/RubyKernel.java:747) ~[jruby-complete-9.2.20.1.jar:?]
OUT at org.jruby.RubyKernel.exit(org/jruby/RubyKernel.java:710) ~[jruby-complete-9.2.20.1.jar:?]
OUT at home.vcap.app.logstash_minus_7_dot_16_dot_3.lib.bootstrap.environment.<main>(/home/vcap/app/logstash-7.16.3/lib/bootstrap/environment.rb:94) ~[?:?]
OUT Exit status 1
some related topics to parse log using Grok: https://newrelic.com/blog/how-to-relic/how-to-use-grok-log-parsing https://docs.newrelic.com/docs/logs/ui-data/parsing/
After fixed plugins and new_relic key etc issues, now set up separate instances of the logstack-shipper app in each environment (development-ssb management-staging management
). Currently in the process of testing a Grok rule within the development environment.
the original message before any New Relic parsing rules:
<14>1 2023-12-14T20:06:39.954251+00:00 gsa-datagov.development.catalog-proxy 0a6741c4-167a-43e3-bc3a-fda36e6a1bef [APP/PROC/WEB/0] - [tags@47450 app_id="0a6741c4-167a-43e3-bc3a-fda36e6a1bef" app_name="catalog-proxy" deployment="cf-production" index="e9142d48-5d18-4f21-a8ee-91d48a62cc84" instance_id="0" ip="10.10.1.9" job="diego-cell" organization_id="90047c5d-337f-4802-bd48-2149a4265040" organization_name="gsa-datagov" origin="rep" process_id="1913b784-daf0-4e83-9d33-e89b4c7b70c8" process_instance_id="d809bf52-9ed4-497d-5d0b-361d" process_type="web" source_id="0a6741c4-167a-43e3-bc3a-fda36e6a1bef" source_type="APP/PROC/WEB" space_id="eab3d327-7d9f-423b-9838-753c26fdb5a0" space_name="development"] NginxLog "POST /'https://catalog.data.gov'/%3Cz3 HTTP/1.1" 500 141
Grok rule:
<%{INT:num}>%{POSINT:ver} %{TIMESTAMP_ISO8601:timestamp} %{DATA:host} %{UUID:proc_id} \[%{DATA:instance_info}\] - \[tags@%{INT:tag_id} app_id="%{UUID:app_id}" app_name="%{DATA:app_name}" deployment="%{DATA:deployment}" index="%{DATA:index}" instance_id="%{INT:instance_id}" ip="%{IP:ip}" job="%{DATA:job}" organization_id="%{UUID:organization_id}" organization_name="%{DATA:organization_name}" origin="%{DATA:origin}" process_id="%{UUID:process_id}" process_instance_id="%{DATA:process_instance_id}" process_type="%{DATA:process_type}" source_id="%{UUID:source_id}" source_type="%{DATA:source_type}" space_id="%{UUID:space_id}" space_name="%{DATA:space_name}"\] %{GREEDYDATA:log_data}
Grok output:
{
"num": 14,
"ver": 1,
"timestamp": "2023-12-14T20:06:39.954251+00:00",
"host": "gsa-datagov.development.catalog-proxy",
"proc_id": "0a6741c4-167a-43e3-bc3a-fda36e6a1bef",
"instance_info": "APP/PROC/WEB/0",
"temp": 47450,
"app_id": "0a6741c4-167a-43e3-bc3a-fda36e6a1bef",
"app_name": "catalog-proxy",
"deployment": "cf-production",
"index": "e9142d48-5d18-4f21-a8ee-91d48a62cc84",
"instance_id": 0,
"ip": "10.10.1.9",
"job": "diego-cell",
"organization_id": "90047c5d-337f-4802-bd48-2149a4265040",
"organization_name": "gsa-datagov",
"origin": "rep",
"process_id": "1913b784-daf0-4e83-9d33-e89b4c7b70c8",
"process_instance_id": "d809bf52-9ed4-497d-5d0b-361d",
"process_type": "web",
"source_id": "0a6741c4-167a-43e3-bc3a-fda36e6a1bef",
"source_type": "APP/PROC/WEB",
"space_id": "eab3d327-7d9f-423b-9838-753c26fdb5a0",
"space_name": "development",
"log_data": "NginxLog \"POST /'https://catalog.data.gov'/%3Cz3 HTTP/1.1\" 500 141"
}
Required fields:
{
"timestamp": "2023-12-14T20:03:41.65677+00:00",
"app_name": "catalog-proxy",
"ip": "10.10.1.9",
"space_name":"development",
"instance_id": "APP/PROC/WEB/0",
"log_data": 'NginxLog "POST /'https://catalog.data.gov'/%3Cz3 HTTP/1.1" 500 141'
}
discussed with Fuhu, for CKAN log message, there are more required field need to be extracted:
Format one:
catalog-dev.data.gov - [2023-12-15T21:11:49.508628938Z] "GET /0000000 HTTP/1.1" 404 0 21445 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36" "127.0.0.1:47846" "10.10.2.10:61092" x_forwarded_for:"108.28.249.216, 64.252.66.176, 127.0.0.1" x_forwarded_proto:"https" vcap_request_id:"f31245e2-7980-4a82-77e9-3c4ca21703ae" response_time:0.037609 gorouter_time:0.000121 app_id:"0a6741c4-167a-43e3-bc3a-fda36e6a1bef" app_index:"0" instance_id:"f410370c-fc91-424b-451b-e47e" x_cf_routererror:"-" x_b3_traceid:"f31245e279804a8277e93c4ca21703ae" x_b3_spanid:"77e93c4ca21703ae" x_b3_parentspanid:"-" b3:"f31245e279804a8277e93c4ca21703ae-77e93c4ca21703ae"
**Fields need to be extracted:**
catalog-dev.data.gov
[2023-12-15T21:11:49.508628938Z]
"GET /0000000 HTTP/1.1" 404 0 21445
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
x_forwarded_for:"108.28.249.216, 64.252.66.176, 127.0.0.1"
Format two:
2023-12-14 20:31:53,839 INFO [ckan.config.middleware.flask_app] 404 /dataset/aaaabbbb render time 0.023 seconds
**Fields need to be extracted:**
2023-12-14 20:31:53,839
INFO
[ckan.config.middleware.flask_app]
404
/dataset/aaaabbbb
render time 0.023 seconds
Those extra fields show up in development space. Will review it with @FuhuXia to see if we need more.
@FuhuXia I added grok rule to separate the request, please check on development space:
Filter the data based on HTTP status codes and generate a pie chart
Double checked on development, those 404 error are all irrelevant traffic.
Discussed with Fuhu, we will implement the following changes:
The Nginx log messages were excluded and are no longer present in the New Relic logs. Additionally, the log from logstack-shipper only contains logs associated with non-200 response codes.
PR: https://github.com/GSA/datagov-logstack/pull/45
We can deploy the changes on prod and find the new crawler agent(s) which is responsible for the recent increased tracking traffic. Previously it can only be done via lengthy cloudfront log processing.
couple of changes need to modified from the grok rules:
two separate fields: http_referer, http_user_agent
change some fields name: http_status -> status request_uri -> request request_size -> bytes_sent response_size -> bytes_received
separate forward ips to two fields: real_ip, forward_ips
Also, only keep fields raw_message_content, log_data in development space for debug purpose.
Reference for the field name: https://djangocas.dev/blog/nginx/nginx-access-log-with-real-x-forwarded-for-ip-instead-of-proxy-ip/
Another big benefit from this grok change is that we can set grok rule to ignore certain white-noise logs. As of now 78% of all logs (based of December 2023 data, 958,729,889 of 1,216,483,493) are logstack-shipper POST/200 logs that offer 0 value to us. NR log query will be faster without them.
logstash-prod-datagov.app.cloud.gov - [2023-12-01T15:12:55.421915605Z]
"POST /?drain-type=all HTTP/1.1" 200 761 2 "-" "fasthttp"
...
All modified fields are in development now:
Also added logic to remove fields raw_message_content, log_data in other non-development envs. It only shows up in development for debug purpose.
Fixed the deploy related issues and have PR https://github.com/GSA/datagov-logstack/pull/48 here. @FuhuXia
confirmed from New Relic, all changes are in prod and other environments now.
User Story
In order to have better control of log streams in NR, the Data.gov Response Team wants to implement application-specific routing rules in our datagov-logstack app.
Acceptance Criteria
Background
Security Considerations (required)
This is supposed to help with log review processes.
Sketch