GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
600 stars 95 forks source link

Implement Grok Rules in logstack application #4234

Closed nickumia-reisys closed 8 months ago

nickumia-reisys commented 1 year ago

User Story

In order to have better control of log streams in NR, the Data.gov Response Team wants to implement application-specific routing rules in our datagov-logstack app.

Acceptance Criteria

Background

Security Considerations (required)

This is supposed to help with log review processes.

Sketch

Jin-Sun-tts commented 10 months ago

both prod and staging use the logstack-shipper app in the management space, setting up one for staging only in the management-staging space for testing.

cf-drain-cli plugin was deprecated. to install drains plugin, download from https://github.com/cloudfoundry/cf-drain-cli/releases/tag/v2.0.0 and install it from binary.

Jin-Sun-tts commented 10 months ago

pushed to management-staging, got message to logstack-shipper, but with following error: x_cf_routererror:"endpoint_failure (tls: failed to verify certificate: x509: certificate is not valid for any names

Jin-Sun-tts commented 10 months ago

By setting up the space-drain in development, all apps log from development can be sent to New Relic now.

Dev/Staging/Prod currently use one logstash shipper from management space. Ideally we would like use separate shipper from each environments, but we got setup error like below:

OUT Using bundled JDK: /home/vcap/app/logstash-7.16.3/jdk
ERR OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
ERR ERROR: File not found for: file:///home/vcap/app/plugins.zip, message: Can't file local file /home/vcap/app/plugins.zip
OUT Installing Cloud Foundry root CA certificate...
OUT Installing certificates: /etc/cf-system-certificates/*
keytool error: **java.io.FileNotFoundException: /etc/cf-system-certificates/*** (No such file or directory)
OUT Invoking start command.
OUT Using bundled JDK: /home/vcap/app/logstash-7.16.3/jdk
ERR OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.

ERR warning: thread "Converge PipelineAction::Create main " terminated with exception (report_on_exception is true):
ERR LogStash::Error: **Don't know how to handle `Java::JavaLang::IllegalStateException` for `PipelineAction::Create<main>`**
ERR create at org/logstash/execution/ConvergeResultExt.java:135
ERR add at org/logstash/execution/ConvergeResultExt.java:60
ERR converge_state at **/home/vcap/app/logstash-7.16.3/logstash-core/lib/logstash/agent.rb**:396
[ERROR][logstash.agent           ] An exception happened when converging configuration {:exception=>LogStash::Error, :message=>"Don't know how to handle `Java::JavaLang::IllegalStateException` for `PipelineAction::Create<main>`"}
OUT [2023-11-22T17:59:27,912][FATAL][logstash.runner          ] An unexpected error occurred! {:error=>#<LogStash::Error: Don't know how to handle `Java::JavaLang::IllegalStateException` for `PipelineAction::Create<main>`>, :backtrace=>["org/logstash/execution/ConvergeResultExt.java:135:in `create'", "org/logstash/execution/ConvergeResultExt.java:60:in `add'", "/home/vcap/app/logstash-7.16.3/logstash-core/lib/logstash/agent.rb:396:in `block in converge_state'"]}
OUT [2023-11-22T17:59:27,918][FATAL][org.logstash.Logstash    ] Logstash stopped processing because of an error: (SystemExit) exit
OUT org.jruby.exceptions.SystemExit: (SystemExit) exit
OUT at org.jruby.RubyKernel.exit(org/jruby/RubyKernel.java:747) ~[jruby-complete-9.2.20.1.jar:?]
OUT at org.jruby.RubyKernel.exit(org/jruby/RubyKernel.java:710) ~[jruby-complete-9.2.20.1.jar:?]
OUT at home.vcap.app.logstash_minus_7_dot_16_dot_3.lib.bootstrap.environment.<main>(/home/vcap/app/logstash-7.16.3/lib/bootstrap/environment.rb:94) ~[?:?]
OUT Exit status 1
Jin-Sun-tts commented 9 months ago

some related topics to parse log using Grok: https://newrelic.com/blog/how-to-relic/how-to-use-grok-log-parsing https://docs.newrelic.com/docs/logs/ui-data/parsing/

Jin-Sun-tts commented 9 months ago

After fixed plugins and new_relic key etc issues, now set up separate instances of the logstack-shipper app in each environment (development-ssb management-staging management). Currently in the process of testing a Grok rule within the development environment.

Jin-Sun-tts commented 9 months ago

the original message before any New Relic parsing rules: <14>1 2023-12-14T20:06:39.954251+00:00 gsa-datagov.development.catalog-proxy 0a6741c4-167a-43e3-bc3a-fda36e6a1bef [APP/PROC/WEB/0] - [tags@47450 app_id="0a6741c4-167a-43e3-bc3a-fda36e6a1bef" app_name="catalog-proxy" deployment="cf-production" index="e9142d48-5d18-4f21-a8ee-91d48a62cc84" instance_id="0" ip="10.10.1.9" job="diego-cell" organization_id="90047c5d-337f-4802-bd48-2149a4265040" organization_name="gsa-datagov" origin="rep" process_id="1913b784-daf0-4e83-9d33-e89b4c7b70c8" process_instance_id="d809bf52-9ed4-497d-5d0b-361d" process_type="web" source_id="0a6741c4-167a-43e3-bc3a-fda36e6a1bef" source_type="APP/PROC/WEB" space_id="eab3d327-7d9f-423b-9838-753c26fdb5a0" space_name="development"] NginxLog "POST /'https://catalog.data.gov'/%3Cz3 HTTP/1.1" 500 141

Grok rule: <%{INT:num}>%{POSINT:ver} %{TIMESTAMP_ISO8601:timestamp} %{DATA:host} %{UUID:proc_id} \[%{DATA:instance_info}\] - \[tags@%{INT:tag_id} app_id="%{UUID:app_id}" app_name="%{DATA:app_name}" deployment="%{DATA:deployment}" index="%{DATA:index}" instance_id="%{INT:instance_id}" ip="%{IP:ip}" job="%{DATA:job}" organization_id="%{UUID:organization_id}" organization_name="%{DATA:organization_name}" origin="%{DATA:origin}" process_id="%{UUID:process_id}" process_instance_id="%{DATA:process_instance_id}" process_type="%{DATA:process_type}" source_id="%{UUID:source_id}" source_type="%{DATA:source_type}" space_id="%{UUID:space_id}" space_name="%{DATA:space_name}"\] %{GREEDYDATA:log_data}

Grok output:

{
    "num": 14,
    "ver": 1,
    "timestamp": "2023-12-14T20:06:39.954251+00:00",
    "host": "gsa-datagov.development.catalog-proxy",
    "proc_id": "0a6741c4-167a-43e3-bc3a-fda36e6a1bef",
    "instance_info": "APP/PROC/WEB/0",
    "temp": 47450,
    "app_id": "0a6741c4-167a-43e3-bc3a-fda36e6a1bef",
    "app_name": "catalog-proxy",
    "deployment": "cf-production",
    "index": "e9142d48-5d18-4f21-a8ee-91d48a62cc84",
    "instance_id": 0,
    "ip": "10.10.1.9",
    "job": "diego-cell",
    "organization_id": "90047c5d-337f-4802-bd48-2149a4265040",
    "organization_name": "gsa-datagov",
    "origin": "rep",
    "process_id": "1913b784-daf0-4e83-9d33-e89b4c7b70c8",
    "process_instance_id": "d809bf52-9ed4-497d-5d0b-361d",
    "process_type": "web",
    "source_id": "0a6741c4-167a-43e3-bc3a-fda36e6a1bef",
    "source_type": "APP/PROC/WEB",
    "space_id": "eab3d327-7d9f-423b-9838-753c26fdb5a0",
    "space_name": "development",
    "log_data": "NginxLog \"POST /'https://catalog.data.gov'/%3Cz3 HTTP/1.1\" 500 141"
  }

Required fields:

{
    "timestamp": "2023-12-14T20:03:41.65677+00:00",
    "app_name": "catalog-proxy",
    "ip": "10.10.1.9",
    "space_name":"development",
    "instance_id": "APP/PROC/WEB/0",
    "log_data": 'NginxLog "POST /'https://catalog.data.gov'/%3Cz3 HTTP/1.1" 500 141'
}
Jin-Sun-tts commented 9 months ago

discussed with Fuhu, for CKAN log message, there are more required field need to be extracted:

Format one: catalog-dev.data.gov - [2023-12-15T21:11:49.508628938Z] "GET /0000000 HTTP/1.1" 404 0 21445 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36" "127.0.0.1:47846" "10.10.2.10:61092" x_forwarded_for:"108.28.249.216, 64.252.66.176, 127.0.0.1" x_forwarded_proto:"https" vcap_request_id:"f31245e2-7980-4a82-77e9-3c4ca21703ae" response_time:0.037609 gorouter_time:0.000121 app_id:"0a6741c4-167a-43e3-bc3a-fda36e6a1bef" app_index:"0" instance_id:"f410370c-fc91-424b-451b-e47e" x_cf_routererror:"-" x_b3_traceid:"f31245e279804a8277e93c4ca21703ae" x_b3_spanid:"77e93c4ca21703ae" x_b3_parentspanid:"-" b3:"f31245e279804a8277e93c4ca21703ae-77e93c4ca21703ae"

**Fields need to be extracted:**

catalog-dev.data.gov
[2023-12-15T21:11:49.508628938Z]
"GET /0000000 HTTP/1.1" 404 0 21445 
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
x_forwarded_for:"108.28.249.216, 64.252.66.176, 127.0.0.1" 

Format two: 2023-12-14 20:31:53,839 INFO [ckan.config.middleware.flask_app] 404 /dataset/aaaabbbb render time 0.023 seconds

**Fields need to be extracted:**

2023-12-14 20:31:53,839 
INFO  
[ckan.config.middleware.flask_app]  
404 
/dataset/aaaabbbb 
render time 0.023 seconds
Jin-Sun-tts commented 9 months ago

Screenshot 2023-12-21 at 12 42 21 PM

Screenshot 2023-12-21 at 12 50 56 PM

Those extra fields show up in development space. Will review it with @FuhuXia to see if we need more.

Jin-Sun-tts commented 8 months ago

@FuhuXia I added grok rule to separate the request, please check on development space:

Image

Jin-Sun-tts commented 8 months ago

Filter the data based on HTTP status codes and generate a pie chart

Image

Jin-Sun-tts commented 8 months ago

Double checked on development, those 404 error are all irrelevant traffic. Screenshot 2023-12-28 at 10 48 28 AM

Jin-Sun-tts commented 8 months ago

Discussed with Fuhu, we will implement the following changes:

Jin-Sun-tts commented 8 months ago

The Nginx log messages were excluded and are no longer present in the New Relic logs. Additionally, the log from logstack-shipper only contains logs associated with non-200 response codes.
PR: https://github.com/GSA/datagov-logstack/pull/45

FuhuXia commented 8 months ago

We can deploy the changes on prod and find the new crawler agent(s) which is responsible for the recent increased tracking traffic. Previously it can only be done via lengthy cloudfront log processing.

Jin-Sun-tts commented 8 months ago

couple of changes need to modified from the grok rules:

Also, only keep fields raw_message_content, log_data in development space for debug purpose.

Reference for the field name: https://djangocas.dev/blog/nginx/nginx-access-log-with-real-x-forwarded-for-ip-instead-of-proxy-ip/

FuhuXia commented 8 months ago

Another big benefit from this grok change is that we can set grok rule to ignore certain white-noise logs. As of now 78% of all logs (based of December 2023 data, 958,729,889 of 1,216,483,493) are logstack-shipper POST/200 logs that offer 0 value to us. NR log query will be faster without them.

logstash-prod-datagov.app.cloud.gov - [2023-12-01T15:12:55.421915605Z] 
"POST /?drain-type=all HTTP/1.1" 200 761 2 "-" "fasthttp" 
...
Jin-Sun-tts commented 8 months ago

All modified fields are in development now:

Screenshot 2024-01-09 at 12 38 29 PM

Also added logic to remove fields raw_message_content, log_data in other non-development envs. It only shows up in development for debug purpose.

Jin-Sun-tts commented 8 months ago

Fixed the deploy related issues and have PR https://github.com/GSA/datagov-logstack/pull/48 here. @FuhuXia

Jin-Sun-tts commented 8 months ago

confirmed from New Relic, all changes are in prod and other environments now.