Closed cottrell closed 5 years ago
@cottrell Sorry! I must have missed it.
The project is pretty alive (https://blog.okfn.org/2018/07/12/sloan-foundation-funds-frictionless-data-for-reproducible-research/) but last months, it's truth, we've been having very limited time resources.
Can you please expand a little bit your question?
Also cc-ing @akariv as a datapipelines specialist.
No worries, I'll paste my question from gitter here, let me know if it needs more expanding. I dug through the code of datapackage and datapipeline a bit but that was a while ago ... will have to reload it in memory if the discussion becomes more detailed.
What is the right place for URL based sources (a simple API that produces datapackages)? Feels like this should maybe be a datapipeline with a request.get and transform and save but not sure. There doesn't seem to be much in the way of extractors or data-pipeline builders. How are people doing this? Basically, want to do a dp-create-from-url
and get a base starter template ... not too hard but after reading a while, I can not see an obvious place to include this in the datapacke or datapipeline projects For example, the datapackage infer method is largely centred around local paths. I started to modify to take URL and do a cache a pull but then thought pipelines should be the way The pattern that emerges with extractors is that you start with singletons .get() with no args, but then you have some that take args. And then you have a kind of collection of convenience arg generators that give you things like date ranges, today etc. So in summary, my question is: is this pattern a pipeline or a datapackage?
@cottrell I tried to start writing something useful a few time but TBH failed. It feels like that is on another level than the libs. Probably the best place to discuss such kind of problems is datahub.io's channel - https://gitter.im/datahubio/chat
@cottrell have you tried using dataflows?
e.g. :
from dataflows import Flow, load, update_package, dump_to_path
Flow(
load(<remote-url-of-file>),
update_package(
name=<dataset-name>,
),
dump_to_path('path/to/out/dir')
).process()
Will create a datapackage containing the CSV/Excel/... that you point to.
Take a look at: https://github.com/datahq/dataflows
I agree with @akariv
Closing for now
I have asks questions a number of times on gitter but got zero responses. Am not sure if fritionlessdata is still a going concern or if it has been abandoned entirely.
Where should datapackage "getters" go for non-local data? Is this a datapipeline or is there some feature of datapackages themselves I have missed? For example, wget plus a schema, delimiter etc.