frictionlessdata / datapackage-py

A Python library for working with Data Packages.
https://frictionlessdata.io
MIT License
191 stars 43 forks source link

No way of manipulating data tables in memory #273

Closed as2875 closed 4 years ago

as2875 commented 4 years ago

@lwinfree, @sje30

When converting to Frictionless from another data format (e.g. HDF5), scripts have to

  1. import the package descriptor,
  2. make a package with the descriptor,
    datapackage_path = pkg_resources.resource_filename(
    __package__,
    "datapackage.json")
    package = datapackage.Package(
    base_path=base,
    descriptor=datapackage_path)
  3. make some CSV files with the right names, and finally
  4. call package.save.

It would be useful to skip the stage with writing CSV files to disk. If this functionality exists and I am missing something, please let me know, it would be very useful.


Please preserve this line to notify @roll (lead of this repository)

lwinfree commented 4 years ago

hi @roll when you are back next week, can you please look at this?

roll commented 4 years ago

Hi @as2875,

could you please elaborate a little bit? You mean you don't want package.save to save the data, only the descriptor?

as2875 commented 4 years ago

Hi @roll. I mean that I don't want to write CSV files to the disk, just the final zipped data package. Say I have some tables stored in Python data structures in memory. Rather than write the tables to CSV files, then call package.save, I would like to create some Resource objects, point them to the data structures, and then call package.save.

The example is from converting multiple HDF5 files to Frictionless data packages. At the moment, I have to create a package, read in the HDF5 datasets, store them as 2-D lists, write the contents of the lists to CSV files, call package.save, then delete the original CSV files. I want to skip the operations involving CSV files.

roll commented 4 years ago

Thanks. I think it's not possible at the moment. I've marked it as feature request

as2875 commented 4 years ago

Thanks @roll. This would make data conversion pipelines and parallel processing a lot simpler.

roll commented 4 years ago

@as2875 BTW there is dataflows - https://github.com/datahq/dataflows, I'm wondering you can achieve this goal using a flow

as2875 commented 4 years ago

Thanks for the suggestion @roll. dataflows looks promising.

roll commented 4 years ago

MERGED into https://github.com/frictionlessdata/frictionless-py/issues/439

More info about Frictionless Framework