frictionlessdata / datapackage-pipelines

Framework for processing data packages in pipelines of modular components.
https://frictionlessdata.io/
MIT License
117 stars 32 forks source link

Datapackage name is not automatically set to pipeline name #176

Closed cschloer closed 4 years ago

cschloer commented 4 years ago

On ubuntu python version 3.7.5 (and all other versions) the resulting datapackage has _ as the name. I'm not sure if this is a consequence of moving over to dataflows, because dataflows uses the add_metadata processor to add name to the datapackage, but I think it's rather important that this feature exist in DPP without needing an extra processor. @roll @akariv

roll commented 4 years ago

@cschloer Is it for every pipeline? I mean how can I reproduce it?

cschloer commented 4 years ago

Yeah this happens for every pipeline created by dpp as far as I can tell.

akariv commented 4 years ago

From the code it looks like this was never dpp's behaviour, and not a recent change...

  1. I think it's better to start the pipeline with the explicit 'update_package' processor instead of magically settting the dataset name by some heuristic
  2. This might be a breaking change to some users, expecting a certain dataset name
cschloer commented 4 years ago

I think setting the datapackage.json name to the name of the pipeline-spec.yaml is not magic at all - it feels in fact like a bug that it is instead set to _ (hence why we created this issue, thinking it was a bug). I see how adding an update package step seems like an easy workaround that can be applied every time, but that will quickly get annoying and feel rather chore-y for something that is supposed to be saving time. I could automatically insert an update_package step into the pipeline based on the pipeline-spec.yaml name but that feels more magic (automatically adding a new step) than it just already being called the pipeline name.

roll commented 4 years ago

I think that not having name at all would be the purist option making a clear separation between pipeline/package (will not happen because it's def breaking).

But as it's anyway set to a value (an underscore) by default I think a pipeline title is a more reasonable default. I didn't see that DPP has this underscore as a part of API of docs so I don't think that changing it will be breaking.

akariv commented 4 years ago

Unless I already have title to be something like 'My Lovely Dataset' which is an invalid value as a dataset name, so would cause the pipeline to fail...

On Thu, Feb 6, 2020 at 11:47 AM roll notifications@github.com wrote:

I think that not having name at all would be the purist option making a clear separation between pipeline/package (will not happen because it's def breaking).

But as it's anyway set to a value (an underscore) by default I think a pipeline title is a more reasonable default. I didn't see that DPP has this underscore as a part of API of docs so I don't think that changing it will be breaking.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/frictionlessdata/datapackage-pipelines/issues/176?email_source=notifications&email_token=AACAY5JIVCH6VDC5KW77G6LRBPMDRA5CNFSM4KLFP6U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK6SGXY#issuecomment-582820703, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACAY5OTD7COD3BRSUWOUIDRBPMDRANCNFSM4KLFP6UQ .

roll commented 4 years ago

Good point. We can slugify, of course. But yea I don't know whether this change is worth it

cschloer commented 4 years ago

I've been using a regular expression to remove invalid characters, set everything to lowercase, and replace spaces with underscores. Though I agree that adds a level of "magic" to the equation.

On Thu, Feb 6, 2020, 11:56 Adam Kariv notifications@github.com wrote:

Unless I already have title to be something like 'My Lovely Dataset' which is an invalid value as a dataset name, so would cause the pipeline to fail...

On Thu, Feb 6, 2020 at 11:47 AM roll notifications@github.com wrote:

I think that not having name at all would be the purist option making a clear separation between pipeline/package (will not happen because it's def breaking).

But as it's anyway set to a value (an underscore) by default I think a pipeline title is a more reasonable default. I didn't see that DPP has this underscore as a part of API of docs so I don't think that changing it will be breaking.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/frictionlessdata/datapackage-pipelines/issues/176?email_source=notifications&email_token=AACAY5JIVCH6VDC5KW77G6LRBPMDRA5CNFSM4KLFP6U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK6SGXY#issuecomment-582820703 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AACAY5OTD7COD3BRSUWOUIDRBPMDRANCNFSM4KLFP6UQ

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/frictionlessdata/datapackage-pipelines/issues/176?email_source=notifications&email_token=ABWNXPYMVC7FBAFAPI6URMTRBPNEJA5CNFSM4KLFP6U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK6TBZY#issuecomment-582824167, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWNXP775OLBQWG4GJK4J4DRBPNEJANCNFSM4KLFP6UQ .

roll commented 4 years ago

@akariv @cschloer Closing it and #178, for now.

We had a discussion and as a summary: