catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

convert CEMS etl process for data packaging #340

Closed cmgosnell closed 4 years ago

cmgosnell commented 5 years ago

I've been putting off converting the CEMS ETL function to data packaging because it's a bit more complex and I have never taken the time to understand it.

I believe we'll always need to have CEMS and EIA 860 together (based on this).. in that case:

karldw commented 5 years ago

When you say "convert the load step to dump CSVs", do you mean the data package output will be the CSVs? Would it be possible to use parquet files instead?

If parquet files work, I should finish up a PR I've been working on with changes to the parquet output schema.

cmgosnell commented 4 years ago

@karldw, if my understanding is correct, we are planning on having the data packages output compresses CSVs instead of parquet files because that is the file system that is supported. That way the data (or at least a portion of it) can be validated with the table schema. But I know you moved forward with having CEMS output parquet files as a standalone output, which is great.

cmgosnell commented 4 years ago

@roll I have two questions about generating metadata for data packages that are using the "Data in Multiple Files" pattern. Having the path be "path": ['table_name_2017.csv.gz', 'table_name_2016.csv.gz', 'table_name_2015.csv.gz'] makes sense to me, but what about 'bytes' and 'hash'? For bytes, should we just add up the individual files?

I'm not really sure where to start for hashing.. a list of hashes? or a dictionary of paths (keys) to hashs (values). On a related note, is hashes tested when running goodtables.validate()?

roll commented 4 years ago

@cmgosnell You mean that these files parts of one big file? If yes, checksums should be calculated from this big file. Otherwise, we need to use different resources for each file.

zaneselvans commented 4 years ago

@roll this is in the case where the tabular data resource has been partitioned into several different files -- how does one include correct sizes and hash values for each of the constituent files in the metadata? And where within the package validation process are the hashes of the files checked? Does goodtables do that? It seems like a package with files whose hashes don't match those provided in the metadata should fail to validate at some point, no?

roll commented 4 years ago

@zaneselvans Sorry for the slow answer I'm not fully around this week.

This one and https://github.com/catalyst-cooperative/pudl/issues/352 still require some analysis from my side. I've created an issue for it - https://github.com/frictionlessdata/pilot-catalyst/issues/5 (cc @cmgosnell)

Both topics (multipart files is the newest addition to the specs and compression is not yet a part of the specs) are "on the tech edge" for us. And I need more time to recommend something.

For now, I would suggest using whatever works (e.g. not checking hashes which will be implemented in goodtables later in this pilot) which gives me a week or two to get back to you with something more concrete.

cc @lwinfree

cmgosnell commented 4 years ago

hey @roll! this all sounds good. For now, I've disabled the hashes for multi-part files and stopped compressing the CEMS files (but left the infrastructure on our side in place to switch that back on when it is enabled). idk if this is expected but I've been testing just summing the bytes of all of the multi-part files and that is not causing any validation errors so I'll keep that in place for now as well.

zaneselvans commented 4 years ago

Is there anything validating the file sizes either on our end or in the FD tools? I.e. is it just succeeding b/c nobody is checking?

roll commented 4 years ago

It's not checked on the FD side at the moment. But I would suggest that we will add these checks during this pilot.