frictionlessdata / datapackage-py

A Python library for working with Data Packages.
https://frictionlessdata.io
MIT License
189 stars 44 forks source link

datapackage getters and where/who to ask questions about frictionless data? #224

Closed cottrell closed 5 years ago

cottrell commented 5 years ago

I have asks questions a number of times on gitter but got zero responses. Am not sure if fritionlessdata is still a going concern or if it has been abandoned entirely.

Where should datapackage "getters" go for non-local data? Is this a datapipeline or is there some feature of datapackages themselves I have missed? For example, wget plus a schema, delimiter etc.

roll commented 5 years ago

@cottrell Sorry! I must have missed it.

The project is pretty alive (https://blog.okfn.org/2018/07/12/sloan-foundation-funds-frictionless-data-for-reproducible-research/) but last months, it's truth, we've been having very limited time resources.

Can you please expand a little bit your question?

Also cc-ing @akariv as a datapipelines specialist.

cottrell commented 5 years ago

No worries, I'll paste my question from gitter here, let me know if it needs more expanding. I dug through the code of datapackage and datapipeline a bit but that was a while ago ... will have to reload it in memory if the discussion becomes more detailed.

What is the right place for URL based sources (a simple API that produces datapackages)? Feels like this should maybe be a datapipeline with a request.get and transform and save but not sure. There doesn't seem to be much in the way of extractors or data-pipeline builders. How are people doing this? Basically, want to do a dp-create-from-url and get a base starter template ... not too hard but after reading a while, I can not see an obvious place to include this in the datapacke or datapipeline projects For example, the datapackage infer method is largely centred around local paths. I started to modify to take URL and do a cache a pull but then thought pipelines should be the way The pattern that emerges with extractors is that you start with singletons .get() with no args, but then you have some that take args. And then you have a kind of collection of convenience arg generators that give you things like date ranges, today etc. So in summary, my question is: is this pattern a pipeline or a datapackage?

roll commented 5 years ago

@cottrell I tried to start writing something useful a few time but TBH failed. It feels like that is on another level than the libs. Probably the best place to discuss such kind of problems is datahub.io's channel - https://gitter.im/datahubio/chat

akariv commented 5 years ago

@cottrell have you tried using dataflows?

e.g. :

from dataflows import Flow, load, update_package, dump_to_path

Flow(
  load(<remote-url-of-file>),
  update_package(
     name=<dataset-name>,
  ),
  dump_to_path('path/to/out/dir')
).process()

Will create a datapackage containing the CSV/Excel/... that you point to.

Take a look at: https://github.com/datahq/dataflows

roll commented 5 years ago

I agree with @akariv

Closing for now