Parquet Java Serialization is very slow

asfimport commented 5 years ago

Hi, I am doing a POC to compare different data formats and its performance in terms of serialization/deserialization speed, storage size, compatibility between different language etc. When I try to serialize a simple java object to parquet file, it takes *6-7 seconds* vs same object's serialization to JSON is 100 milliseconds.

Could you help me to resolve this issue?

+My Configuration and code snippet: Gradle dependencies dependencies

{ compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' }

Code snippet:+

public void serialize(List inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException {

Path path = new Path("s3a://parquetpoc/data"+compressionCodecName+".parquet"); Path path1 = new Path("/Downloads/data"+compressionCodecName+".parquet"); Class clazz = inputDataToSerialize.get(0).getClass();

try (ParquetWriter writer = AvroParquetWriter.builder(path1) .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields .withDataModel(ReflectData.get()) .withConf(parquetConfiguration) .withCompressionCodec(compressionCodecName) .withWriteMode(OVERWRITE) .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) .build()) {

for (D input : inputDataToSerialize)

{ writer.write(input); }

} }

+Model Used: @Data public class Employee

{ //private UUID id; private String name; private int age; private Address address; }

@Data public class Address

{ private String streetName; private String city; private Zip zip; }

@Data public class Zip

{ private int zip; private int ext; }

private List getInputDataToSerialize(){ Address address = new Address(); address.setStreetName("Murry Ridge Dr"); address.setCity("Murrysville"); Zip zip = new Zip(); zip.setZip(15668); zip.setExt(1234);

address.setZip(zip);

List employees = new ArrayList<>();

IntStream.range(0, 100000).forEach(i->{ Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); employee.setAge(20); employee.setName("Test"+i); employee.setAddress(address); employees.add(employee); }); return employees; }

Note: I have tried to save the data into local file system as well as AWS S3, but both are having same result - very slow.

Reporter: Felix Kizhakkel Jose / @FelixKJose

_{Note: This issue was originally created as PARQUET-1680. Please see the migration documentation for further details.}

asfimport commented 5 years ago

Felix Kizhakkel Jose / @FelixKJose: Could someone please help me on this?

asfimport commented 5 years ago

Felix Kizhakkel Jose / @FelixKJose: Any help would be much appreciated.

apache / parquet-java

Parquet Java Serialization is very slow #1569