datahq / dataflows

DataFlows is a simple, intuitive lightweight framework for building data processing flows in python.
https://dataflows.org
MIT License
194 stars 39 forks source link

Why not combining the data resources and the datapackage.json into one json file? #92

Closed SPTKL closed 5 years ago

SPTKL commented 5 years ago

I've been using dataflows to process data and dumping to s3 using a custom dumper I wrote based on the datapackage-pipelines-aws package. Everything works pretty well however when it comes to version control, I've encountered issues. because the data file(usually a csv) and the datapacakge files are dumped separately, it makes it difficult to compare existing versions (using md5 checksum). so I would create a new version of the datapackage.json but not the csv. With the current structure, it's difficult to say if we are creating a new datapackage.json, cache a new csv too. I was wondering if it would be beneficial to dump data resources with datapackage.json in one big json file?

akariv commented 5 years ago

Actually datapackage standard allows for inli ne data, so that's definitely possible (and compliant with the spec). In many cases, having a single json file is not ideal as it would be difficult to stream and usually would require loading it entirely into memory - which wouldn't be good for very la rge datasets. However, for small and medium datasets it could work. Check out the jsondumper class - you can modify it to achieve what you want. If you want to make a PR (e.g. enable this using an inline=True parameter) it would be awesome.

On Fri, May 31, 2019, 21:59 Baiyue Cao notifications@github.com wrote:

I've been using dataflows to process data and dumping to s3 using a custom dumper I wrote based on the datapackage-pipelines-aws package. Everything works pretty well however when it comes to version control, I've encountered issues. because the data file(usually a csv) and the datapacakge files are dumped separately, it makes it difficult to compare existing versions (using md5 checksum). so I would create a new version of the datapackage.json but not the csv. With the current structure, it's difficult to say if we are creating a new datapackage.json, cache a new csv too. I was wondering if it would be beneficial to dump data resources with datapackage.json in one big json file?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/datahq/dataflows/issues/92?email_source=notifications&email_token=AACAY5IYYMN6NRN5PYHXEKDPYFYRTA5CNFSM4HR5HGSKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GXAY3MA, or mute the thread https://github.com/notifications/unsubscribe-auth/AACAY5NMNSAJILR5MU22B3TPYFYRTANCNFSM4HR5HGSA .