djimenez / iconv-go

iconv support for Go
BSD 2-Clause "Simplified" License
416 stars 66 forks source link

Would you please add a func like this :DecodeLastRune ? #3

Open hardPass opened 11 years ago

hardPass commented 11 years ago

This func (https://golang.org/pkg/unicode/utf8/#DecodeLastRune) is very convenient and useful. Would you please add to iconv a func like this DecodeLastRuneByCharset(p []byte, string charSet) (r rune, size int)

djimenez commented 11 years ago

Maybe I'm missing the use case, but that seems like a bad idea to me. The function you propose would be transcoding the source byte slice every call - isn't it better to transcode it once into utf-8 / utf-16 and then use the rune functions as normal?

hardPass commented 11 years ago

It's common in log analysis system and some edi applications. The p []byte is very large such as 1GB, which come from a big log file or a large edi message, so our job-breaking-server split or slice it into 100 small ones by size 10MB, then our distributed application nodes would deal with the small ones concurrently and swiftly. It's common that log has delimiter \n, and edi message has it's delimiter according to the biz-logic. In order to ensure the delimiter is the last rune of a small []byte, we could use a func like DecodeLastRune to check just several runes of the tail one by one , instead of transcoding all the file. Some times we check from tail ,and some times we check from head of the next segment where func DecodeRune is useful. Hope you can understand my sloppy and unauthentic English.

djimenez commented 11 years ago

I'd like to suggest an alternative solution to your problem. The bufio.Scanner type is designed for your use case (http://golang.org/pkg/bufio/#Scanner). You can use it to split the large data as a byte stream in the source encoding or you can chain it with iconv.Reader to transcode the data as you go and deal with it in your target encoding.

For splitting on newlines of a transcoded stream it would look similar to this (error handling omitted):

log := ... source io.Reader ...
converter, _ := iconv.NewConverter("cp1252", "utf-8")
scanner := bufio.NewScanner(iconv.NewReaderFromConverter(log, converter))

for scanner.Scan() {
    line := scanner.Text()

    // do something with line which is in utf-8 encoding
}
hardPass commented 11 years ago

This scan-way still traverses all bytes in the file, which is inexpensive in memory, inefficiency and unnecessary . Let me think the whole logic for a while

djimenez commented 11 years ago

You could use Seek or ReadFrom on the log if its something like os.File, but to do this in a meaningful way I think you're going to have to look for your delimiters in their source encoding. So, if you had a very large file that you've already parsed through most of and have a the position of the last delimiter parsed previously, it might look like:

delimiter := "\n"
logEncoding := "ebcdic"

log, _ := os.Open("my.log")
// seek to our previous position, or could be used to seek to a position relative to the end
log.Seek(12345, 0)

// convert our target delimiter to the source byte encoding
encodedDelimiterBytes := []byte(iconv.ConvertString(delimiter, "utf-8", logEncoding))

// read through the log ourselves looking for delimiters and buffering lines, or use bufio.Scanner
// and its ability to take a custom Split function

I'm not sure how much trouble you might run into with files that use shift encoding (I don't often deal with them myself).

Anyway, maybe I'm still missing the point - what do you think iconv.DecodeLastRune would look like? Simplest would be this:

func DecodeLastRune(p []byte, charset string) (rune, int) {
    return DecodeLastRunInString(string(p), charset)
}

func DecodeLastRuneInString(p string, charset string) (rune, int) {
    // convert to utf8 string
    converted, err := ConvertString(p, charset, "utf-8");
    if err != nil {
        return utf8.RuneError, 1
    }
    return utf8.DecodeLastRuneInString(converted)
}

Wouldn't want to call that in a loop looking for a delimiter from a Reader.

hardPass commented 11 years ago

I do need to look for the last delimiter. And It's a really good idea:

encodedDelimiterBytes := []byte(iconv.ConvertString(delimiter, "utf-8", logEncoding))

Actually some of our applications have been running in this way. But it still has a problem. It only works when the encoding is "utf-8" or any other sensible encoding. And when the charset is 'GBK', it has a trap. Let me show you.


func main() {
    delimiter := '~'
    utf_str := "DD*\u4e8a*anything~"
    gbk_str, _ := iconv.ConvertString(utf_str, "utf-8", "GBK")
    fmt.Printf("delimiter ~ :%d \n", delimiter)
    fmt.Println("utf_str:", []byte(utf_str))
    fmt.Println("gbk_str:", []byte(gbk_str))

}

out put:
delimiter ~ :126 
utf_str: [68 68 42 228 186 138 42 97 110 121 116 104 105 110 103 *126*]
gbk_str: [68 68 42 129 *126*  42 97 110 121 116 104 105 110 103 *126*]

It is very common to use ~ as delimiter in x12-edi message, and you can see the byte of ~ is 126. But in gbk_str, there is 2 126 bytes where only one ~ in the string. So find ~ in a GBK file is unexpectable. This trap would not happen with utf-8 encoding, because multi-byte sequences only represented by bytes that larger than 127.