apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.15k stars 416 forks source link

[BUG][CH] Corner case of unbase64 #7092

Open zhanglistar opened 2 weeks ago

zhanglistar commented 2 weeks ago

Backend

CH (ClickHouse)

Bug description

vanila spark: select unbase64('AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA¬AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAyAAAAAAAEAAAAAAAAAAggAAAAAAAAAAQAsCQAAgMAAAJECQBUUEQBUAEFAEBMEAEEYRwQFAEFAEAARAACCAEQFEAVRwCAgAAqIAgAAAAALgAQIA7AIAAAABcBNCAQQYjAMAAACAgBptQxGSDAxcACBJBBQQhC5GnCBOVAVJAeQ==') . . . . . . . . . . . . . . . > ;

image

Gluten, got the same exception with Clickhouse, bigo :) select base64Decode('AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA¬AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAyAAAAAAAEAAAAAAAAAAggAAAAAAAAAAQAsCQAAgMAAAJECQBUUEQBUAEFAEBMEAEEYRwQFAEFAEAARAACCAEQFEAVRwCAgAAqIAgAAAAALgAQIA7AIAAAABcBNCAQQYjAMAAACAgBptQxGSDAxcACBJBBQQhC5GnCBOVAVJAeQ==') as b

image
zhanglistar commented 2 weeks ago
image

Need ignore invalid char when decoding.

zhanglistar commented 2 weeks ago

Here is the smallest code to reproduce the problem: java code used by Apache Spark:

cat Base64DecodeToFile.java
import java.util.Base64;
import java.io.FileOutputStream;
import java.io.IOException;

public class Base64DecodeToFile {
    public static void main(String[] args) {
        String encodedString = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA¬AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAyAAAAAAAEAAAAAAAAAAggAAAAAAAAAAQAsCQAAgMAAAJECQBUUEQBUAEFAEBMEAEEYRwQFAEFAEAARAACCAEQFEAVRwCAgAAqIAgAAAAALgAQIA7AIAAAABcBNCAQQYjAMAAACAgBptQxGSDAxcACBJBBQQhC5GnCBOVAVJAeQ==";
        byte[] decodedBytes = Base64.getMimeDecoder().decode(encodedString);

        String filePath = "output.txt";

        try (FileOutputStream outputStream = new FileOutputStream(filePath)) {
            outputStream.write(decodedBytes);
            System.out.println("SUC: " + filePath);
        } catch (IOException e) {
            System.out.println("FAIL");
            e.printStackTrace();
        }
    }
}

Gluten code use https://github.com/aklomp/base64.git, if we do not ignore invalid character, we got error as in Clikchouse:

image

But if we ignore invalid character as Apache Spark, we get the same output:

image
zhanglistar commented 2 weeks ago

While things are not so easy, base64 in Apache Spark uses RFC 2045, but Clickhouse use RFC 4648(also Starrocks use), which used more widely. Not know why Spark choose RFC 2045, not a wise option.

So we leave it alone here for now to see if we need to implete RFC 2045 in Clickhouse, lots of work to do.

Ref: https://zh.wikipedia.org/wiki/Base64