Open zaneselvans opened 2 years ago
Hey @katie-lamb I went ahead and assigned you to this one. I think it's a mix of problems stemming from missing pandas datatype metadata, and the way that Intake is interpreting & displaying the data it finds in the Parquet files.
The issues that I think are related to pandas metadata / writing Parquet files are:
string
IDs unitid
and unit_id_epa
showing up as object
columns.facility_id
is allowed to be null, so it should be Int32
and not int32
, while plant_id_eia
can't be null, so it should be int32
The
source.discover()
method shows some details about the internals of a data source within an intake catalog. E.g.However, some of this information doesn't reflect what's in the parquet files as well as it could. We should make sure:
unitid
andunit_id_epa
show up asstring
notobject
category
columnsstate
,so2_mass_measurement_code
,nox_rate_measurement_code
,nox_mass_measurement_code
, andco2_mass_measurement_code
show up ascategory
instead ofint64
(presumably they're appearing as integers because integers are keys in a dictionary of categorical values?)shape
tuple should indicate the number of rows in the dataset, rather thanNone
since that information is stored in the Parquet file metadata.Some of these issues seem to be arising from Intake, and some of them seem to arise from the metadata that's getting written to the Parquet files in ETL. Looking at the type information for a sample of the data after it's been read back into a pandas dataframe:
The categorical values show up correctly as categories, but the other type issues (nullability, string vs/ object) remain. In my experimentation with different ways of writing out the files I think I did see strings, nullable types, and category types coming through fine in this information in the past, so I think there's something wrong with the Parquet metadata. Reading in one file and looking at the metadata directly, they all appear to be correct:
However... the
epacems.schema.pandas_metadata
isNone
so it's relying on the default mapping of PyArrow types to Pandas types, which isn't what we want it to do.Why isn't the pandas metadata being embedded in the Parquet file? Is it possible to explicitly insert it? The function that's writing the Parquet files is:
pudl.etl._etl_one_year_epacems()
and it's usingpa.Table.from_pandas()
so.... wtf?