astronomy-commons / hats-import

HATS import - generate HATS-partitioned catalogs
https://hats-import.readthedocs.io
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

How many row groups per file should we aim for? #400

Open troyraen opened 2 weeks ago

troyraen commented 2 weeks ago

My understanding is that we want a small-ish number of row groups per file because that makes the dataset loads faster. On the flip side, I imagine that putting too many rows in a group would make row lookups slower. So there's probably a sweet spot, but I haven't really tested this out. In my experience, most catalogs end up with about 1-4 row groups per file (given defaults for the writer kwargs) but I've run into a very different case and am wondering what to do.

I found that the ZTF Light Curves hipscat catalog I made has a mean of 22 row groups per file, with almost 2000 files having more than 40 row groups each. That must have happened because this dataset is skinny (<15 columns), so a 500MB file has 10s of millions of rows (hipscat-import doesn’t specify a max-rows-per-group kwarg so must be using pyarrow’s default which maxes out at ~1 million). Since this is a 10TB dataset, the cumulative difference in efficiency between large/small row groups could be significant and I'd rather not put out a product that's extra hard to work with. I may be re-making these files before we release them anyway, or at least making additional products as ZTF puts out new data releases, so I could adjust this then.

Does anyone have a sense of whether reducing the number of row groups would make this dataset easier to use and/or at what point it would be counter productive because there'd be too many rows in each group?

nevencaplar commented 1 week ago

This was discussed during the Friday office hours, October 04, 2024.

The conclusion is that we do not have a firm suggestion at this point. Further benchmarking is needed to answer this question.