frictionlessdata / datapackage-js

A JavaScript library for working with Data Package.
http://frictionlessdata.io/
MIT License
42 stars 15 forks source link

Support saving/loading for zip file #93

Closed roll closed 6 years ago

roll commented 6 years ago

Overview

It's part of specs implementation requirements:

http://specs.frictionlessdata.io/implementation/

rufuspollock commented 6 years ago

My 2c: this sort of stuff should not go into this core lib - we'll get really bloated. Instead we have help libraries to do writing / serializing of data package to specific forms.

Also i didn't realize this was part of implementation guide. We don't yet have an agreed spec for compressing or bundling - https://github.com/frictionlessdata/specs/issues/132 + https://github.com/frictionlessdata/specs/issues/290. I think it best to get those complete before people start implementing.

/cc @pwalsh as czar of implementations

rufuspollock commented 6 years ago

@roll what happened re my comment on this? /cc @vitorbaptista

roll commented 6 years ago

@rufuspollock It's a part of the specs implementation requirements - https://frictionlessdata.io/specs/implementation/. At the moment datapackage implementations on 8 languages support saving/loading a data package to zip. So this implementations just follows the general approach.

Also it's a key feature for the project we work now with our partners.

I would say bloating is a problem only for client-side JavaScript. Server-side libraries don't really sensitive for things this like this. So if datapackage-py could load/save zip there is no reason to remove this capability. It's good for users and don't really affect installation speed/size etc.

And for the client-side JavaScript lib we just could do bundle-size optimization/splitting. I've done a lot of profiling and for now I would say the most problematic size-wise is shipping profiles not deps like this.

rufuspollock commented 6 years ago

@roll my point was there was no response to the comment left on 13 sep until you shipped the feature now. In my original comment i pointed out that i did not understand why this was part of the implementation specs when it is not part of the spec itself.

But the real point is about how we factor this library - something i've mentioned a lot. /cc @vitorbaptista

roll commented 6 years ago

@rufuspollock

In my original comment i pointed out that i did not understand why this was part of the implementation specs when it is not part of the spec itself.

That's the question to @pwalsh. As mentioned it's a part of the implementation specs. Also it was a part of every tool fund implementation contract. I could agree on having it (and I do) or disagree but I work on the implementations level. High level requirements like this are something to decide on the yours/@pwalsh specs level.

pwalsh commented 6 years ago

The implementation document was designed to make it easier for new implementors (people not working on the Python or Javascript libraries). In particular, for those implementing for the tool fund.

I personally would not add zip support to the javascript library, unless we already have a good story in place for removing certain code paths in a browser distribution of the lib (I can't imagine reading or writing zips in the browser is very useful). On the other hand, I don't think we should keep talking about bloat without looking at the facts (distribution size of lib, etc.).

Stephen-Gates commented 6 years ago

I’ll be exploring the zip feature for use in an electron app.

pwalsh commented 6 years ago

I’ll be exploring the zip feature for use in an electron app.

Great! Will this be handled via a Node.js process, or the client-side Javascript?

roll commented 6 years ago

@Stephen-Gates I'll be finishing this work around Jan 7 (when I'll be back from the break) and also I'd like to sync with you/your team on some questions like the one mentioned by Paul.

rufuspollock commented 6 years ago

@roll before you complete could we consider whether stuff like this should go out of package as a separate mini library. /cc @pwalsh @vitorbaptista

See the data library analysis and recommendation document here:

https://hackmd.io/CwZgnOCMDs0LQEMAckBMdSsnMAjaIcADKgKYBmArGOSACbRJhA==?view

@pwalsh the bloat i'm talking about is less about package zip and more about conceptual complexity and library structure.

vitorbaptista commented 6 years ago

@rufuspollock If we think just on the library API level, loading/saving a datapackage ZIP adds:

Ignore the method names, as they might be different. The risk I see is if we add support for other packaging methods in the future (tar.bz2, lzma, ...), which would requiring adding new methods (or renaming the ones we have). Other than that, I don't see how this adds much conceptual complexity to the libraries' interfaces (it does add some to their implementation).

On a technical level, it would be great if we could avoid having a datapackage_with_zip_support.js and datapackage_without_zip_support.js. Instead, we could dynamically detect if a ZIP library is available. Something like:

function to_zip(path) {
    if (!has_zip_library()) {
        throw Error('ZIP support needs library X')
    }
    // ...
}

As, considering our experience with datapackage-py, the ZIP support adds just a few lines of code, so file size increase is negligible.

Stephen-Gates commented 6 years ago

@roll we have our kick off for Data Curator v2 on Wed 10 Jan. Perhaps we can catch up after that?

rufuspollock commented 6 years ago

@vitorbaptista the issue is not the size of the code bundle but the overall approach and conceptual complexity. All software is made up of features that are small in themselves but which in aggregate are complex. The point here is not about zip per se but about the general principal of where are "loaders" and "dumpers" located - as per our discussion back in November.

vitorbaptista commented 6 years ago

@rufuspollock I understand, I just don't think this adds much complexity to the library interface. I mean the API of the package won't change much, as conceptually reading from a ZIP is basically the same as reading from the local filesystem or HTTP. It's just a different protocol, but the intent (reading a file) is the same.

It does add some complexity to the implementation of the library, but that's OK IMHO. Also, implementing before having it as part of the spec is a risk. However, as this pattern is already working on datapackage-py for a while, I don't see it becoming much different whenever it enters the spec. We can tweak the code as needed.

On where "loaders" and "dumpers" should be located, which discussion are you referring to? The data.js one? Are you referring to the dump_to_zip(datapackage, path) vs datapackage.to_zip(path) different styles ("thin" vs "fat" objects)?

roll commented 6 years ago

I think https://github.com/frictionlessdata/implementations/issues is better place to discuss high-level architecture.

My 2c on this:

I don't think that fat objects vs thin objects or something like this is a key question. TBH each of this tableschema/datapackage libs are relatively simple.

But there 9(!) implementations of the tableschema + datapackage libraries. And there are almost 3 implementations are exact ports of each others - Python/JavaScript/Ruby. For example zip support here is just copied from datapackage-py.

I think the key question is a maintainability and mass adoption of this libraries. On Frictionless Data project level fat/thin.. I don't see the difference is a key. What really matters in my eyes - how this libraries could be maintained with such a limited resources and how it will be adopted.

So I just want to note that considering any global changes to the libs we better see the whole picture of the implementation stack. For example if we will dramatically change datapackage-js lib how it affects the whole project:

PS. If I start from scratch datapackage lib now for only one environment (e.g. JavaScript) I will probably goes with inversion of control and etc. It's smart and great for many use case (esp. great for JavaScript). But what I always was trying to emphasize - Frictionless Data implementations stack is much more that just 1 lib.