apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.36k stars 2.2k forks source link

Storing Lot of Sparse Columns #1078

Closed asheeshgarg closed 7 months ago

asheeshgarg commented 4 years ago

I have around 40 K columns in a spark dataframe lot of them have null values. While storing the data to iceberg table it takes lot of time though the final data in parquet in 20-25 Mbs. Is the sparse row dataframe to iceberg persist is supported?

rdblue commented 4 years ago

What support are you referring to?

asheeshgarg commented 4 years ago

Just like we have in Spark MLLib Sparse and dense representation. Another thing if we don't store the values if its null while reading the data it will return the default values.

rdblue commented 4 years ago

Just like we have in Spark MLLib Sparse and dense representation.

I'm not familiar with what you're referring to, so a link would help give context.

Iceberg uses columnar formats that help, but doesn't automatically convert to a sparse representation if that's what you're referring to.

asheeshgarg commented 4 years ago

Thanks Ryan Iceberg uses columnar formats that help, but doesn't automatically convert to a sparse representation if that's what you're referring to. -> Yes

So currently what is happening I have 32K records of 40K columns in this case 80% of columns have a null value. if I persist this spark dataframe as csv roughly data comes out to 4 gb and it takes 4 min to persist the data in s3 bucket. if I persist this spark dataframe as parquet as as iceberg roughly data comes out to 19 MB and it takes 8 min to persist the data in s3 bucket. I tried uncompressed codec for parquet but timing doesn't improved. I feel the time is taken mostly in internal datastruture of parquet like dictionary encoding and other optimization for columnar data store. What I was referring is there a way to drop these 80% columns as they are null during storage which will significantly reduce the write time using some sparse storage technique.
So while reading back we can return default value for the columns. Any suggestion will be really helpful.

medb commented 3 years ago

@asheeshgarg Did you resolve this issue? Maybe you just need to increase number of records that are written in each Parquet file? (if slowness come from the fact that persisting sparse data in Parquet produces many small files)

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 7 months ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'