apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.65k stars 1.41k forks source link

Parquet Files Getting Emptied After Delete ParquetWriter #3014

Closed tomnoah1 closed 1 month ago

tomnoah1 commented 2 months ago

Describe the bug, including details regarding any error messages, version, and platform.

Version: 1.14.1

I got the following code:

(1) writer = AvroParquetWriter
            .builder<GenericRecord>(ParquetOutputFile(localFile))
            .withSchema(SCHEMA)
            .build()
(2) writer.write(genericRecord)
(3) writer.close()
(4) writer = AvroParquetWriter
            .builder<GenericRecord>(ParquetOutputFile(localFile2))
            .withSchema(SCHEMA)
            .build()

After the third line, I can see the file with the data (genericRecord), and read it. For some reason, after the 4th line, the file is getting empty. It contains no data and weight 0 bytes. When trying to read it I am getting: File(<some_name>) cannot be read as parquet. File matching that expression not found.

Without the 4th line, the file and its content remains steady.

Component(s)

No response

wgtmac commented 2 months ago

Thanks for reporting the issue! Could you provide the complete code to reproduce it?

tomnoah1 commented 1 month ago

Sadly I can't export it, but it is something just like what I sent. The problem occur in line 4, where I created a new writer and delete the ole one (the garbage collector delete the old one because it is the same var name). At that moment, it emptied the file.

FlechazoW commented 1 month ago

My example code runs fine, and I can’t reproduce your issue. Could you take a look at my example code? The code is as follows:

import java.io.IOException;
import java.nio.file.Paths;
import org.apache.avro.Schema;
import org.apache.avro.SchemaBuilder;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.parquet.avro.AvroParquetWriter;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.io.LocalOutputFile;

public class _02_Parquet_Example {
  public static void main(String[] args) throws IOException {
    // Define the Avro schema
    Schema schema = SchemaBuilder.record("User")
        .fields()
        .name("name").type().stringType().noDefault()
        .name("age").type().intType().noDefault()
        .endRecord();

    // Create a GenericRecord
    GenericRecord genericRecord = new GenericData.Record(schema);
    genericRecord.put("name", "John Doe");
    genericRecord.put("age", 30);

    // Parquet file paths
    String localFilePath = "./output-file1.parquet";
    String localFilePath2 = "./output-file2.parquet";

    LocalOutputFile localOutputFile = new LocalOutputFile(Paths.get(localFilePath));
    LocalOutputFile localOutputFile2 = new LocalOutputFile(Paths.get(localFilePath2));

    // Write to the first Parquet file
    ParquetWriter<GenericRecord> writer = AvroParquetWriter
        .<GenericRecord>builder(localOutputFile)
        .withSchema(schema)
        .build();
    writer.write(genericRecord);
    writer.close();

    // Write to the second Parquet file with the same record
    writer = AvroParquetWriter
        .<GenericRecord>builder(localOutputFile2)
        .withSchema(schema)
        .build();
    writer.write(genericRecord);
    writer.close();

    System.out.println("Data written to Parquet files successfully.");
  }
}
FlechazoW commented 1 month ago

@tomnoah1

tomnoah1 commented 1 month ago

You are writing it again. Try:

// Write to the second Parquet file with the same record
    writer = AvroParquetWriter
        .<GenericRecord>builder(localOutputFile2)
        .withSchema(schema)
        .build();

Instead of:

// Write to the second Parquet file with the same record
    writer = AvroParquetWriter
        .<GenericRecord>builder(localOutputFile2)
        .withSchema(schema)
        .build();
    writer.write(genericRecord);
    writer.close();
tomnoah1 commented 1 month ago

Now when I took a second look I see we did on the second initialization:

writer = AvroParquetWriter
        .<GenericRecord>builder(localOutputFile)
        .withSchema(schema)
        .build()

That means, we accidentally used again localOutputFile instead of localOutputFile2, and I guess thats what cause the problem. A new writer to the same filename with the same path.

FlechazoW commented 1 month ago

@tomnoah1 If there are no other issues with this, you can close it. Thanks.

wgtmac commented 1 month ago

Thanks @FlechazoW for the help!