LibertyDSNP / parquetjs

Fully asynchronous, pure JavaScript implementation of the Parquet file format with additional features
MIT License
43 stars 24 forks source link

Are bloom filters supported on LIST types? #98

Closed ljwagerfield closed 7 months ago

ljwagerfield commented 8 months ago

Hi there 👋

Firstly, thank you for this amazing library!

I'm curious to know how to add bloom filters to LIST types.

For example, given this schema:

{
  querystring: {
    type: "LIST",
    fields: {
      list: {
        repeated: true,
        fields: {
          element: {
            fields: {
              key: { type: "UTF8" },
              value: { type: "UTF8" }
            }
          }
        }
      }
    }
  }
}

How do you add a bloom filter for the querystring.list.element.key field?

[
  {
    column: "querystring.list.element.key",
    numDistinct: 100
  }
]

I assume the above won't work? (Sorry in advance if that literally is how you do it!)

Thanks in advance!

wilwade commented 8 months ago

Hi @ljwagerfield !

Been looking into this. It might work, but I think there might also be a bug in the column naming causing issues.

So your setup of how you think it might work, is approximately how I think it likely should work (or close to it).

So I think this is a bug. The library's bloom filter does not currently handle nested fields at all. (Although most of the pieces are in place for it to do so).

For someone who wants to work on making this possible here are some notes:

Here is a simple setup for what "should" work, but doesn't due at least in part to the note ^.

const main = async () => {

  const file = "parquet-testing/issue-98.parquet";

  const schema = new parquet.ParquetSchema({
    querystring: {
      type: "LIST",
      fields: {
        list: {
          repeated: true,
          fields: {
            element: {
              fields: {
                key: { type: "UTF8" },
                value: { type: "UTF8" }
              }
            }
          }
        }
      }
    }
  });

  try {
    const writer = await parquet.ParquetWriter.openFile(schema, file, {
      bloomFilters: [
        {
          column: "querystring,list,element,key",
        },
      ],
    });

    await writer.appendRow({ querystring: { list: [ { element: { key: "foo", value: "bar", }, }, { element: { key: "foo2", value: "bar2", } } ] } });
    await writer.close();
  } catch (error: any) {
    console.log("I'm in the write catch!", error)
  }

  try {
    const reader = await parquet.ParquetReader.openFile(file);
    const cursor = reader.getCursor();
    console.log("row", await cursor.next());
    const metadata = reader.getMetadata();
    console.log("metadata", metadata);
    const bloomFilters = await reader.getBloomFiltersFor(["querystring,list,element,key"])
      console.log("bloomFilters", bloomFilters);
  } catch (error: any) {
    console.log("I'm in the read catch!", file, error)
  }
}

main()
ljwagerfield commented 8 months ago

Aha, interesting!

So the Parquet specification does support bloom filters on lists (I wasn't even sure of this), and this library is close to supporting an implementation for that.

That's awesome!

I don't have any spare cycles at present (or the Parquet knowledge!) to contribute, unfortunately, so am happy if you want to close.

Great to know, though, and thanks again! 👍 👍 👍

wilwade commented 8 months ago

I'll leave it open for now. With the quick test I wrote, I might be able to get a fix in if I can get the time. Others might want to pick it up as well.

wilwade commented 7 months ago

@ljwagerfield This release is going out with support!

https://github.com/LibertyDSNP/parquetjs/releases/tag/v1.3.5

ljwagerfield commented 7 months ago

Awesome, thanks so much! 🚀