fluent-plugins-nursery / fluent-plugin-systemd

This is a fluentd input plugin. It reads logs from the systemd journal.
Apache License 2.0
153 stars 43 forks source link

Log messages are split into 2048 byte chunks #52

Closed spacecowboy closed 6 years ago

spacecowboy commented 6 years ago

Our system is printing JSON-formatted logs and in some cases the message inside of the JSON object can be relatively long. In that case, fluentd-systemd appears to be splitting the long entry into 2048 byte pieces.

The same thing was the case for syslog but they added an option it: https://docs.fluentd.org/v1.0/articles/in_syslog#message_length_limit

errm commented 6 years ago

It seems the default size of a systemd journal field should be 64K https://github.com/ledbettj/systemd-journal/blob/master/lib/systemd/journal.rb#L175

Seems I need to dig a little deeper ... I am going to try and reproduce ...

@spacecowboy what exactly is the behaviour you are seeing... is it truncation of a particular systemd field at 2K or truncation of the outputted fluentd messages?

spacecowboy commented 6 years ago

fluentd config

so this is the fluentd config we have added (other config files are default)

# All logs from systemd
<source>
    @type systemd
    path /run/log/journal
    pos_file /var/log/gcp-journald-all.pos
    read_from_head true
    tag systemd
    logName systemd
</source>

# Retag worker logs
<match systemd>
  @type rewrite_tag_filter
  rewriterule1 _SYSTEMD_UNIT neo4j-backup-worker systemd.neo4j-backup-worker
</match>

# Parse JSON logs
<filter systemd.neo4j-backup-worker>
    @type parser
    format json
    key_name MESSAGE
    reserve_data false
</filter>

# Attach environment property to everything
<filter **>
    @type record_transformer
    <record>
        neo4jcloud.environment "#{ENV['ENVIRONMENT']}"
        neo4jcloud.role "#{ENV['ROLE']}"
        neo4jcloud.shard "#{ENV['SHARD']}"
    </record>
</filter>

# Rewrite the tag on systemd messages to include the unit name
<filter systemd.**>
    @type record_transformer
    enable_ruby true
    tag ${(record["_SYSTEMD_UNIT"] || "journald").gsub("@","_at_").gsub(".service","")}
    logName ${(record["_SYSTEMD_UNIT"] || "journald").gsub("@","_at_").gsub(".service","")}
</filter>

And that is getting processed for each split message. The jsonPayload in StackDriver looks like this:

Log entry 1

jsonPayload: {
  MESSAGE: "{"asctime": "2018-03-08 11:14:17,281", "name": "root", "pathname": "/usr/local/miniconda/lib/python3.6/site-packages/neo4jcloud_adminapi/shell.py", "lineno": 35, "message": "Doing full backup...\n2018-03-08 11:14:12.459+0000 INFO [o.n.c.s.StoreCopyClient] Copying neostore.nodestore.db.labels\n2018-03-08 11:14:12.542+0000 INFO [o.n.c.s.StoreCopyClient] Copied neostore.nodestore.db.labels 8.00 kB\n2018-03-08 11:14:12.542+0000 INFO [o.n.c.s.StoreCopyClient] Copying neostore.nodestore.db\n2018-03-08 11:14:12.611+0000 INFO [o.n.c.s.StoreCopyClient] Copied neostore.nodestore.db 8.00 kB\n2018-03-08 11:14:12.611+0000 INFO [o.n.c.s.StoreCopyClient] Copying neostore.propertystore.db.index.keys\n2018-03-08 11:14:12.617+0000 INFO [o.n.c.s.StoreCopyClient] Copied neostore.propertystore.db.index.keys 15.96 kB\n2018-03-08 11:14:12.617+0000 INFO [o.n.c.s.StoreCopyClient] Copying neostore.propertystore.db.index\n2018-03-08 11:14:12.622+0000 INFO [o.n.c.s.StoreCopyClient] Copied neostore.propertystore.db.index 8.00 kB\n2018-03-08 11:14:12.622+0000 INFO [o.n.c.s.StoreCopyClient] Copying neostore.propertystore.db.strings\n2018-03-08 11:14:12.627+0000 INFO [o.n.c.s.StoreCopyClient] Copied neostore.propertystore.db.strings 8.00 kB\n2018-03-08 11:14:12.627+0000 INFO [o.n.c.s.StoreCopyClient] Copying neostore.propertystore.db.arrays\n2018-03-08 11:14:12.634+0000 INFO [o.n.c.s.StoreCopyClient] Copied neostore.propertystore.db.arrays 24.00 kB\n2018-03-08 11:14:12.635+0000 INFO [o.n.c.s.StoreCopyClient] Copying neostore.propertystore.db\n2018-03-08 11:14:12.639+0000 INFO [o.n.c.s.StoreCopyClient] Copied neostore.propertystore.db 15.94 kB\n2018-03-08 11:14:12.640+0000 INFO [o.n.c.s.StoreCopyClient] Copying neostore.relationshipstore.db\n2018-03-08 11:14:12.645+0000 INFO [o.n.c.s.StoreCopyClient] Copied neostore.relationshipstore.db 15.94 kB\n2018-03-08 11:14:12.645+0000 INFO [o.n.c.s.StoreCopyClient] Copying neostore.relationshiptypestore.db.names\n2018-03-08 11:14:12.653+0000 INFO [o.n.c.s.StoreCopyClient] Copied neostore.relationshiptypes"   
  PRIORITY: "6"   
  SYSLOG_FACILITY: "3"   
  SYSLOG_IDENTIFIER: "neo4jcloud-async-worker"   
  _BOOT_ID: "e8a9f63a27b7400fa34905db6cff5910"   
  _CAP_EFFECTIVE: "0"   
  _CMDLINE: "python3.6 -u -m neo4jcloud_adminapi.database_worker"   
  _COMM: "python3.6"   
  _EXE: "/usr/local/miniconda/bin/python3.6"   
  _GID: "115"   
  _HOSTNAME: "ip-172-31-27-171"   
  _MACHINE_ID: "bb6f625fbb9c4c21b9e4c1d7aeb2d532"   
  _PID: "2548"   
  _SYSTEMD_CGROUP: "/system.slice/neo4j-backup-worker.service"   
  _SYSTEMD_INVOCATION_ID: "2d5fc99a86c0474d83727f650256cb5e"   
  _SYSTEMD_SLICE: "system.slice"   
  _SYSTEMD_UNIT: "neo4j-backup-worker.service"   
  _TRANSPORT: "stdout"   
  _UID: "111"   
  logName: "neo4j-backup-worker"   
  neo4jcloud.environment: "tc-xaezei0a"   
  neo4jcloud.role: "backup_worker"   
  neo4jcloud.shard: "0"   
 }

Log entry 2

 jsonPayload: {
  MESSAGE: "tore.db.names 15.96 kB\n2018-03-08 11:14:12.653+0000 INFO [o.n.c.s.StoreCopyClient] Copying neostore.relationshiptypestore.db\n2018-03-08 11:14:12.658+0000 INFO [o.n.c.s.StoreCopyClient] Copied neostore.relationshiptypestore.db 8.00 kB\n2018-03-08 11:14:12.659+0000 INFO [o.n.c.s.StoreCopyClient] Copying neostore.labeltokenstore.db.names\n2018-03-08 11:14:12.664+0000 INFO [o.n.c.s.StoreCopyClient] Copied neostore.labeltokenstore.db.names 15.96 kB\n2018-03-08 11:14:12.665+0000 INFO [o.n.c.s.StoreCopyClient] Copying neostore.labeltokenstore.db\n2018-03-08 11:14:12.669+0000 INFO [o.n.c.s.StoreCopyClient] Copied neostore.labeltokenstore.db 8.00 kB\n2018-03-08 11:14:12.672+0000 INFO [o.n.c.s.StoreCopyClient] Copying neostore.schemastore.db\n2018-03-08 11:14:12.679+0000 INFO [o.n.c.s.StoreCopyClient] Copied neostore.schemastore.db 8.00 kB\n2018-03-08 11:14:12.679+0000 INFO [o.n.c.s.StoreCopyClient] Copying neostore.relationshipgroupstore.db\n2018-03-08 11:14:12.683+0000 INFO [o.n.c.s.StoreCopyClient] Copied neostore.relationshipgroupstore.db 15.97 kB\n2018-03-08 11:14:12.683+0000 INFO [o.n.c.s.StoreCopyClient] Copying neostore.counts.db.a\n2018-03-08 11:14:12.685+0000 INFO [o.n.c.s.StoreCopyClient] Copied neostore.counts.db.a 8.00 kB\n2018-03-08 11:14:12.685+0000 INFO [o.n.c.s.StoreCopyClient] Copying neostore.labelscanstore.db\n2018-03-08 11:14:12.692+0000 INFO [o.n.c.s.StoreCopyClient] Copied neostore.labelscanstore.db 48.00 kB\n2018-03-08 11:14:12.692+0000 INFO [o.n.c.s.StoreCopyClient] Copying neostore\n2018-03-08 11:14:12.697+0000 INFO [o.n.c.s.StoreCopyClient] Copied neostore 23.99 kB\n2018-03-08 11:14:12.697+0000 INFO [o.n.c.s.StoreCopyClient] Done, copied 17 files\n2018-03-08 11:14:12.780+0000 INFO [o.n.b.BackupService] Start recovering store\n2018-03-08 11:14:16.121+0000 INFO [o.n.b.BackupService] Finish recovering store\nDoing consistency check...\n2018-03-08 11:14:16.400+0000 INFO [o.n.k.i.s.f.RecordFormatSelector] Selected RecordFormat:StandardV3_2[v0.A.8] record format from store /tmp/d5f0c631.2018-03-08T11"   
  PRIORITY: "6"   
  SYSLOG_FACILITY: "3"   
  SYSLOG_IDENTIFIER: "neo4jcloud-async-worker"   
  _BOOT_ID: "e8a9f63a27b7400fa34905db6cff5910"   
  _CAP_EFFECTIVE: "0"   
  _CMDLINE: "python3.6 -u -m neo4jcloud_adminapi.database_worker"   
  _COMM: "python3.6"   
  _EXE: "/usr/local/miniconda/bin/python3.6"   
  _GID: "115"   
  _HOSTNAME: "ip-172-31-27-171"   
  _MACHINE_ID: "bb6f625fbb9c4c21b9e4c1d7aeb2d532"   
  _PID: "2548"   
  _SYSTEMD_CGROUP: "/system.slice/neo4j-backup-worker.service"   
  _SYSTEMD_INVOCATION_ID: "2d5fc99a86c0474d83727f650256cb5e"   
  _SYSTEMD_SLICE: "system.slice"   
  _SYSTEMD_UNIT: "neo4j-backup-worker.service"   
  _TRANSPORT: "stdout"   
  _UID: "111"   
  logName: "neo4j-backup-worker"   
  neo4jcloud.environment: "tc-xaezei0a"   
  neo4jcloud.role: "backup_worker"   
  neo4jcloud.shard: "0"   
 }

Log entry 3

jsonPayload: {
  MESSAGE: ":14:08.959918+00:00\n2018-03-08 11:14:16.400+0000 INFO [o.n.k.i.s.f.RecordFormatSelector] Format not configured. Selected format from the store: RecordFormat:StandardV3_2[v0.A.8]\n2018-03-08 11:14:16.449+0000 INFO [o.n.m.MetricsExtension] Initiating metrics...\n....................  10%\n....................  20%\n....................  30%\n....................  40%\n....................  50%\n....................  60%\n....................  70%\n....................  80%\n....................  90%\n...................Checking node and relationship counts\n....................  10%\n....................  20%\n....................  30%\n....................  40%\n....................  50%\n....................  60%\n....................  70%\n....................  80%\n....................  90%\n.................... 100%\nBackup complete.\n", "exc_info": null, "severity": "INFO", "message_id": "47686498436282", "check_consistency": "True", "latency": "0.991261", "backup_id": "2018-03-08T11:14:08.959918+00:00", "backup_bucket": "backup.tc-xaezei0a.neo4j.io", "action": "backup", "dbid": "d5f0c631", "timestamp": "2018-03-08T11:14:09.155303Z"}"   
  PRIORITY: "6"   
  SYSLOG_FACILITY: "3"   
  SYSLOG_IDENTIFIER: "neo4jcloud-async-worker"   
  _BOOT_ID: "e8a9f63a27b7400fa34905db6cff5910"   
  _CAP_EFFECTIVE: "0"   
  _CMDLINE: "python3.6 -u -m neo4jcloud_adminapi.database_worker"   
  _COMM: "python3.6"   
  _EXE: "/usr/local/miniconda/bin/python3.6"   
  _GID: "115"   
  _HOSTNAME: "ip-172-31-27-171"   
  _MACHINE_ID: "bb6f625fbb9c4c21b9e4c1d7aeb2d532"   
  _PID: "2548"   
  _SYSTEMD_CGROUP: "/system.slice/neo4j-backup-worker.service"   
  _SYSTEMD_INVOCATION_ID: "2d5fc99a86c0474d83727f650256cb5e"   
  _SYSTEMD_SLICE: "system.slice"   
  _SYSTEMD_UNIT: "neo4j-backup-worker.service"   
  _TRANSPORT: "stdout"   
  _UID: "111"   
  logName: "neo4j-backup-worker"   
  neo4jcloud.environment: "tc-xaezei0a"   
  neo4jcloud.role: "backup_worker"   
  neo4jcloud.shard: "0"   
 }

Note that the MESSAGE attribute of each message (except maybe the last) is exactly 2048 bytes long (I copied it to a file and ran wc -m on it)

The output in this case is coming from a single logging.info call in Python (where we have a JSONFormater attached to the logging framework) and you can verify that it really is coming from a single logging call in python because the messages all make up a single JSON object.

The python process is running inside a systemd service.

Thanks for looking into it @errm . Let me know if I can provide additional details

errm commented 6 years ago

Hi @spacecowboy this seems quite strange since I don't think the truncation is happening in fluentd, but if you have the default configuration systemd shouldn't start truncating the message until it reaches 64K....

Do you see the same truncation looking at the same entries with another tool e.g. journalctl?

spacecowboy commented 6 years ago

Hmm. I re-ran it to reproduce, and strangely enough I DO see the chunking in the systemd journal for the service. So this is actually not a problem with this plugin.

Thanks for helping out though @errm , sorry for wasting your time.

errm commented 6 years ago

No problem... thanks for taking the time to report this anyway!