NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
25 stars 1 forks source link

dcp_lion parquet creation issues #600

Open fvankrieken opened 9 months ago

fvankrieken commented 9 months ago

Didn't want to hold up CEQR stuff to focus on this, but dcp_lion archival fails specifically for parquet generation

RuntimeError: FileWriter::Close() failed with Only 29 out of 130 columns are initialized
May be caused by: Terminating translation prematurely after failed
translation of layer lion (use -skipfailures to skip errors)
May be caused by: Unable to write feature 65536 from layer lion.
May be caused by: WriteColumnChunk() failed for field LBoro: Writing DictionaryArray with null encoded in dictionary type not yet supported

We can stick with pg_dump for now, but will want to fix this eventually. Could try just skipping failures as well and seeing if we have the same number of NULLs for LBORO as we do in pg dump

fvankrieken commented 9 months ago

Interestingly, 65536 is the default ROW_GROUP_SIZE for parquet gdal driver. Doesn not seem like a coincidence