Hi,
I am doing a POC to compare different data formats and its performance in terms of serialization/deserialization speed, storage size, compatibility between different language etc.
When I try to serialize a simple java object to parquet file, it takes *6-7 seconds* vs same object's serialization to JSON is 100 milliseconds.
Could you help me to resolve this issue?
+My Configuration and code snippet:Gradle dependencies
dependencies
public void serialize(List inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException {
Path path = new Path("s3a://parquetpoc/data"+compressionCodecName+".parquet");
Path path1 = new Path("/Downloads/data"+compressionCodecName+".parquet");
Class clazz = inputDataToSerialize.get(0).getClass();
private ListgetInputDataToSerialize(){
Address address = new Address();
address.setStreetName("Murry Ridge Dr");
address.setCity("Murrysville");
Zip zip = new Zip();
zip.setZip(15668);
zip.setExt(1234);
Hi, I am doing a POC to compare different data formats and its performance in terms of serialization/deserialization speed, storage size, compatibility between different language etc. When I try to serialize a simple java object to parquet file, it takes *6-7 seconds* vs same object's serialization to JSON is 100 milliseconds.
Could you help me to resolve this issue?
+My Configuration and code snippet: Gradle dependencies dependencies
{ compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' }
Code snippet:+
public void serialize(List inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException {
Path path = new Path("s3a://parquetpoc/data"+compressionCodecName+".parquet"); Path path1 = new Path("/Downloads/data"+compressionCodecName+".parquet"); Class clazz = inputDataToSerialize.get(0).getClass();
try (ParquetWriter writer = AvroParquetWriter.builder(path1)
.withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields
.withDataModel(ReflectData.get())
.withConf(parquetConfiguration)
.withCompressionCodec(compressionCodecName)
.withWriteMode(OVERWRITE)
.withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
.build()) {
for (D input : inputDataToSerialize)
{ writer.write(input); }
} }
+Model Used: @Data public class Employee
{ //private UUID id; private String name; private int age; private Address address; }
@Data public class Address
{ private String streetName; private String city; private Zip zip; }
@Data public class Zip
{ private int zip; private int ext; }
private List getInputDataToSerialize(){
Address address = new Address();
address.setStreetName("Murry Ridge Dr");
address.setCity("Murrysville");
Zip zip = new Zip();
zip.setZip(15668);
zip.setExt(1234);
address.setZip(zip);
List employees = new ArrayList<>();
IntStream.range(0, 100000).forEach(i->{ Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); employee.setAge(20); employee.setName("Test"+i); employee.setAddress(address); employees.add(employee); }); return employees; }
Note: I have tried to save the data into local file system as well as AWS S3, but both are having same result - very slow.
Reporter: Felix Kizhakkel Jose / @FelixKJose
Note: This issue was originally created as PARQUET-1680. Please see the migration documentation for further details.