Open zhanglistar opened 2 weeks ago
Need ignore invalid char when decoding.
Here is the smallest code to reproduce the problem: java code used by Apache Spark:
cat Base64DecodeToFile.java
import java.util.Base64;
import java.io.FileOutputStream;
import java.io.IOException;
public class Base64DecodeToFile {
public static void main(String[] args) {
String encodedString = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA¬AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAyAAAAAAAEAAAAAAAAAAggAAAAAAAAAAQAsCQAAgMAAAJECQBUUEQBUAEFAEBMEAEEYRwQFAEFAEAARAACCAEQFEAVRwCAgAAqIAgAAAAALgAQIA7AIAAAABcBNCAQQYjAMAAACAgBptQxGSDAxcACBJBBQQhC5GnCBOVAVJAeQ==";
byte[] decodedBytes = Base64.getMimeDecoder().decode(encodedString);
String filePath = "output.txt";
try (FileOutputStream outputStream = new FileOutputStream(filePath)) {
outputStream.write(decodedBytes);
System.out.println("SUC: " + filePath);
} catch (IOException e) {
System.out.println("FAIL");
e.printStackTrace();
}
}
}
Gluten code use https://github.com/aklomp/base64.git, if we do not ignore invalid character, we got error as in Clikchouse:
But if we ignore invalid character as Apache Spark, we get the same output:
While things are not so easy, base64 in Apache Spark uses RFC 2045, but Clickhouse use RFC 4648(also Starrocks use), which used more widely. Not know why Spark choose RFC 2045, not a wise option.
So we leave it alone here for now to see if we need to implete RFC 2045 in Clickhouse, lots of work to do.
Backend
CH (ClickHouse)
Bug description
vanila spark:
select unbase64('AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA¬AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAyAAAAAAAEAAAAAAAAAAggAAAAAAAAAAQAsCQAAgMAAAJECQBUUEQBUAEFAEBMEAEEYRwQFAEFAEAARAACCAEQFEAVRwCAgAAqIAgAAAAALgAQIA7AIAAAABcBNCAQQYjAMAAACAgBptQxGSDAxcACBJBBQQhC5GnCBOVAVJAeQ==') . . . . . . . . . . . . . . . > ;
Gluten, got the same exception with Clickhouse,
bigo :) select base64Decode('AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA¬AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAyAAAAAAAEAAAAAAAAAAggAAAAAAAAAAQAsCQAAgMAAAJECQBUUEQBUAEFAEBMEAEEYRwQFAEFAEAARAACCAEQFEAVRwCAgAAqIAgAAAAALgAQIA7AIAAAABcBNCAQQYjAMAAACAgBptQxGSDAxcACBJBBQQhC5GnCBOVAVJAeQ==') as b