fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.73k stars 1.56k forks source link

GBK Test Data Handling Issue #9107

Open opencmit2 opened 1 month ago

opencmit2 commented 1 month ago

Bug Report

Describe the bug

Issue Content:

GBK Test Data Source

echo 小明 | iconv -f utf8 -t gbk  >> /tmp/test.log 

Fluent Bit Configuration File

[INPUT]
    Name         tail
    Tag          dummy.local
    Path         /tmp/test.log
[FILTER]
    Name wasm
    match *
    Event_Format json
    WASM_Path /data/flb312/etc/filter.wasm
    Function_Name go_filter
    accessible_paths .
[OUTPUT]
    Name  stdout
    Match *

Event_Format Set to JSON Handling Code and Corresponding Output

//export go_filter
func go_filter(tag *uint8, tag_len uint, time_sec uint, time_nsec uint, record *uint8, record_len uint) *uint8 {
        brecord := unsafe.Slice(record, record_len)
        fmt.Println(brecord) // [123 34 108 111 103 34 58 34 208 161 238 131 131 34 125]
        var p fastjson.Parser
        value, err := p.Parse(string(brecord))
        if err != nil {
                fmt.Printf("Error parsing JSON: %v\n", err)
                return nil
        }
        logValue := value.GetStringBytes("log")
        fmt.Printf("%v\n", logValue)  //[208 161 238 131 131]
        return nil
}

Fluent Bit Configuration File

[INPUT]
    Name         tail
    Tag          dummy.local
    Path         /tmp/test.log
[FILTER]
    Name wasm
    match *
    Event_Format msgpack
    WASM_Path /data/flb312/etc/filter.wasm
    Function_Name go_filter
    accessible_paths .
[OUTPUT]
    Name  stdout
    Match *

Event_Format Set to msgpack Handling Code and Corresponding Output

//export go_filter
func go_filter(tag *uint8, tag_len uint, time_sec uint, time_nsec uint, record *uint8, record_len uint) *uint8 {
        brecord := unsafe.Slice(record, record_len)
        fmt.Println(brecord) // [129 163 108 111 103 164 208 161 195 247]
        var logData map[string]interface{}
        if err := msgpack.Unmarshal(brecord, &logData); err != nil {
                panic(err)
        }
        if logStr, ok := logData["log"].(string); ok {
                fmt.Printf("%v\n", []byte(logStr)) //[208 161 195 247]
        }
        return nil
}

When Event_Format is set to JSON, the byte slice is [208 161 238 131 131].

When Event_Format is set to MessagePack, the byte slice is [208 161 195 247].

Only the byte slice [208 161 195 247] can be successfully transcoded from GBK to UTF-8. I suspect that Fluent Bit might be performing additional processing when Event_Format is set to JSON.

Expected behavior

Screenshots

Event_Format set to JSON

image

Event_Format set to MessagePack image

Your Environment

Additional context

cosmo0920 commented 1 month ago

Only the byte slice [208 161 195 247] can be successfully transcoded from GBK to UTF-8.

First of all, GBK is only compatible UTF-8 encoding for ASCII part like cp932 does. So, the GBK translated logs are not compatible for UTF-8 encoding assumed mechanism for Wasm.

To create msgpack payload, we just process as-is and adding the additional metadata. To create json payload, we process them with escaping for JSON adoption.

I also confirmed that handling as msgpack is not affected for encodings. Currently, we didn'y support non UTF-8 encodings. Meanwhile, if possible, could you use mgspack format for processing your non UTF-8 payloads?

duj4 commented 3 weeks ago

Only the byte slice [208 161 195 247] can be successfully transcoded from GBK to UTF-8.

First of all, GBK is only compatible UTF-8 encoding for ASCII part like cp932 does. So, the GBK translated logs are not compatible for UTF-8 encoding assumed mechanism for Wasm.

To create msgpack payload, we just process as-is and adding the additional metadata. To create json payload, we process them with escaping for JSON adoption.

I also confirmed that handling as msgpack is not affected for encodings. Currently, we didn'y support non UTF-8 encodings. Meanwhile, if possible, could you use mgspack format for processing your non UTF-8 payloads?

hi @cosmo0920 , would there be any plan adding encoding/decoding function to INPUT plugin so that the non-UTF-8 encoding logs could be converted in prior? Many of our applications support GB-2312 only and their log files have to be converted to UTF-8 in prior to be processing by FluentBit.

cosmo0920 commented 3 weeks ago

Only the byte slice [208 161 195 247] can be successfully transcoded from GBK to UTF-8.

First of all, GBK is only compatible UTF-8 encoding for ASCII part like cp932 does. So, the GBK translated logs are not compatible for UTF-8 encoding assumed mechanism for Wasm. To create msgpack payload, we just process as-is and adding the additional metadata. To create json payload, we process them with escaping for JSON adoption. I also confirmed that handling as msgpack is not affected for encodings. Currently, we didn'y support non UTF-8 encodings. Meanwhile, if possible, could you use mgspack format for processing your non UTF-8 payloads?

hi @cosmo0920 , would there be any plan adding encoding/decoding function to INPUT plugin so that the non-UTF-8 encoding logs could be converted in prior? Many of our applications support GB-2312 only and their log files have to be converted to UTF-8 in prior to be processing by FluentBit.

I'm still considering this type of encoding conversion. My encoding environment of Windows is almost using Shift-JIS(cp932). So, I'm also hitting this issue and this is one of the not highly proceeded to replace with Fluent Bit from Fluentd here. Fluentd provides convenient way to convert from non-ASCII encoding to UTF-8. This issue is now revealed that it's quite larger than we expected.

duj4 commented 3 weeks ago

I'm still considering this type of encoding conversion. My encoding environment of Windows is almost using Shift-JIS(cp932). So, I'm also hitting this issue and this is one of the not highly proceeded to replace with Fluent Bit from Fluentd here. Fluentd provides convenient way to convert from non-ASCII encoding to UTF-8. This issue is now revealed that it's quite larger than we expected.

Thanks @cosmo0920 for the reply.

Yes, I found this function is supported in Fluentd as well and that's the reason why I asked if it is possible to migrate it here. Good to know that it is not "abandoned" yet :D