Open asfimport opened 3 years ago
Gabor Szadovszky / @gszadovszky: I can see that this is a regression in the new parquet release. Before this change I remember some other issues related to empty files/row groups but I can't recall them.
Meanwhile, I am curious the purpose of writing empty parquet files. What are the actual information that Drill stores in such files. The schema can be stored easily in a string form in any files/databases. parquet-mr actually has support to write/parse such string formed parquet schemas.
Vitalii Diravka / @vdiravka: @gszadovszky Thanks for checking this. Drill can store empty tables (schema without data) in other formats than Parquet. The Drill feature is that it can do CTAS with different formats. So user can choose what format to use for CTAS from the beginning of Drill usage and all tables will be created within that format. Some format as CSV and JSON have difficulties with special CTAS queries and PARQUET for sure wins in most cases. So this format is used as the default one for DRILL. So with new change the possibility of CTAS with limit 0 for Parquet format is dropped. The possibility to create empty tables with other files is possible for Parquet CTAS mode, but it will be some hybrid mode, not clear parquet files tables. And since Drill is not regular DB there is no some hybrid mode to create tables with different formats, where the main aim is just successfully create the table.
Therefore from the Drill perspective it would be great to have possibility to create empty parquet files and recognize them as valid, possibly with passing some explicit flag to the endBlock() method. And my subjective point of view: in real world I think there are a lot of cases, where only schema is present and this info is still valuable. I think Parquet should be able to handle such kind of data.
Gabor Szadovszky / @gszadovszky: @vdiravka, thanks for the explanation. I still think that an empty table should not require an empty parquet file to be created. Meanwhile, I am not against allowing to create an empty parquet file but we have to investigate this carefully. Is the format itself allow to logically create an empty file? E.g. what should be the accepted value for data/dictionary page offsets? (These are required fields.) If we think the format allows this we shall write proper unit tests in parquet-mr to ensure we can handle empty files in any scenarios/with any bindings. Even though it is a regression we could not catch it because we did not have any unit tests for it. I think, the ability to create empty files was more a hidden feature than an intentional one. If we re-introduce this feature we shall do it properly.
Gabor Szadovszky / @gszadovszky: @vdiravka, Based on the discussions on the recent Parquet sync meeting the community is not against allowing to create empty parquet files. Meanwhile, we do not have the bandwidth to invest on this feature. Feel free to contribute and I am happy to help/review.
PARQUET-1851 starts abandon to write parquet files with schema (meta information), but with 0 rows, aka empty files. In result it prevent to store empty tables in DRILL by using parquet files, for example:
So PARQUET-1851 breaks the following test cases:
I suggest to use warning in the process of creating empty parquet files or create alternative endBlock for backward compatibility with other tools:
Reporter: Vitalii Diravka / @vdiravka
Related issues:
Original Issue Attachments:
Note: This issue was originally created as PARQUET-2026. Please see the migration documentation for further details.