aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 141 forks source link

Modifying and writing back results in a corrupted file #441

Closed fasterinnerlooper closed 6 months ago

fasterinnerlooper commented 6 months ago

Issue with modifying a parquet file, and then saving it back to the same file.

I just posted the same question to StackOverflow as well. Here's the link: https://stackoverflow.com/questions/77675637/problem-modifying-a-parquet-file-using-parquet-dotnet

Hi, I'm not sure what I'm doing wrong here but every time I write the file back, it results in a corrupted file. Can someone help me figure out what I'm doing wrong, please?

            using var memorystream = new MemoryStream();
            using var reader = await ParquetReader.CreateAsync(filestream);
            using var writer = await ParquetWriter.CreateAsync(reader.Schema, memorystream);

            Console.WriteLine($"Reading file {filename}");
            Console.WriteLine(string.Empty);
            var tasks = new List<Task>();
            for (int i = 0; i < reader.RowGroupCount; i++)
            {
                Console.SetCursorPosition(0, Console.CursorTop - 1);
                Console.Write(Enumerable.Repeat(' ', Console.BufferWidth).ToArray());
                Console.WriteLine($"\rReading row group {i + 1} of {reader.RowGroupCount}");
                Console.WriteLine(string.Empty);
                var table = await reader.ReadAsTableAsync(rowGroupIndex: i);
                foreach (var row in table)
                {
                    Console.SetCursorPosition(0, Console.CursorTop - 1);
                    Console.Write(Enumerable.Repeat(' ', Console.BufferWidth).ToArray());
                    Console.WriteLine($"\rProcessing row {table.IndexOf(row) + 1} of {table.Count}");
                    var row1 = row.GetString(0);
                    var row2 = row.GetString(1);
                    row[0] = NetCodeParser.RemoveComments(row1);
                    row[1] = NetCodeParser.RemoveComments(row2);
                }
                using (var groupWriter = writer.CreateRowGroup())
                {
                    await groupWriter.WriteAsync(table);
                }
                Console.SetCursorPosition(0, Console.CursorTop - 1);
            }
            var tempFile = Path.GetRandomFileName();
            using var tempStream = new FileStream(tempFile, FileMode.OpenOrCreate, FileAccess.ReadWrite, FileShare.ReadWrite);
            memorystream.WriteTo(tempStream);
            try
            {
                _ = await ParquetReader.CreateAsync(tempFile);
            }
            catch (IOException)
            {
                Console.WriteLine("File modifications failed");
            }
            memorystream.WriteTo(filestream);
            Console.WriteLine("Successfully updated parquet file");
            writer.Dispose();

Here is the error message I get: Error Message from Visual Studio

aloneguid commented 6 months ago

You're constructing reader on empty stream which is not a valid parquet data structure. You can do something like:

  1. create mem stream
  2. create writer
  3. write data
  4. rewind mem stream to position 0
  5. write mem stream to file
  6. dispose reader
  7. rewind mem stream to position 0
  8. create reader
  9. read
  10. dispose reader

hope this helps

aloneguid commented 6 months ago

btw you can always create writer on file stream directly, no need to introduce memory stream

fasterinnerlooper commented 6 months ago

Part of the problem was that I was writing the data back but with line breaks. Is there a standard way of encoding line breaks into parquet files?

aloneguid commented 6 months ago

Part of the problem was that I was writing the data back but with line breaks. Is there a standard way of encoding line breaks into parquet files?

Underlying stream? It should be binary stream, parquet data is binary. If you're struggling show me full piece of code ;)