Open roll opened 2 months ago
frictionless-r ignores the resource.compression
property. It does support reading compressed files based on the extension found in path
: https://docs.ropensci.org/frictionless/reference/read_resource.html#file-compression
So I'm neutral regarding promoting this to the specs: frictionless-r will likely continue to ignore it.
Regarding resource.innerPath
, I'd rather not see an additional property for files, since we already have to deal with data
and path
. I would express this in path
, as follows:
"path": "path/to/my/archive.zip/data.csv"
frictionless-r ignores the resource.compression property. So I'm neutral regarding promoting this to the specs
Agreed -- inferring compression from the path seems just fine. Do we get some other value from having a "compression" property that I'm missing?
Regarding resource.innerPath, I'd rather not see an additional property for files, since we already have to deal with data and path. I would express this in path, as follows:
I think this works for local paths, but not as well for remote ones because it's harder to detect where the zip file is (archive.zip may not be an actual zip file, but just part of the url, and it's harder to check compared to local)
That said, I'm not keen on innerPath
either. Regarding compression I would expect two main scenarios / use cases:
1) Individual resources are compressed (as described in the pattern). This is useful for remote data packages being hosted remotely -- you only need to download the data you need (and it is transferred compressed). This doesn't need innerPath
because the compressed files do not contain multiple files.
2) The entire data package (including datapackage.json) is compressed. This is useful for an archival blob of the entire package that can be distributed as a single unit, without dependencies. (Similar to an opendocument spreadsheet with multiple sheets). This doesn't need innerPath
either, because everything is already inside the zip file so paths are already internal to the zip.
innerPath
is only applicable when a data package is a) referencing multiple resources in a single zip and b) the datapackage.json isn't included in that level of compression. I don't think this is something we want to support / encourage -- Or maybe I'm missing a benefit / use case?
I don't think this is something we want to support / encourage.
I agree.
Agreed -- inferring compression from the path seems just fine. Do we get some other value from having a "compression" property that I'm missing?
I will play a devil's advocate role here, but you know, inferring a Table Schema usually works just fine as well :smiley: So in my opinion it is just a question of increasing interoperability documentation quality. I think, currently, the spec doesn't mention compression at all so the behavior does look just undefined generally speaking. I think we at least need to clarify it. On the other hand as there is already resource.format
, resource.compression
feels like the same kind of indicator.
That said, I'm not keen on innerPath either. Regarding compression I would expect two main scenarios / use cases:
So regarding inner path I think it's only applicable if a data publisher has to use some artifact i.e. ZIP file that they cannot control so they map resources from this archive similarly how excel sheets mapped onto resources with Table Dialect
I will play a devil's advocate role here, but you know, inferring a Table Schema usually works just fine as well 😃
Haha, to play counter devil's advocate: It's standard for file names to include compression type in their extension (file1.csv.zip, file2.csv.gz), but there's not a similar standard for field names to include frictionless field type information (column1.integer, column2.boolean, column3.number).
To me, the question is, do we want to allow compressed paths without extensions, or compressed paths with extensions that don't match the compression type? Otherwise, resource.compression
is redundant and will be largely ignored by implementations (as frictionless-r does right now).
I think it's only applicable if a data publisher has to use some artifact i.e. ZIP file that they cannot control so they map resources from this archive
If they don't control the ZIP, then there's all kinds of malformed scenarios we can imagine... The question is where we draw the line.
For example, the ZIP could have other nested ZIPs in it, which in turn hold the table data... That would require nested innerPath
properties, which I don't think we should support either.
similarly how excel sheets mapped onto resources with Table Dialect
Selecting an excel sheet in a workbook or a table in an SQLite db is a lot more well-defined, I think, because unlike a generic multi-file archive they have a lot of guarantees / constraints (e.g. they only hold a specific kind of table data and cannot be nested)
Side note -- Does the sheetName
property in Table Dialect also allow you to select particular tables in an SQL db? (are SQLite DBs considered as "spreadsheet" formats?)
I think, currently, the spec doesn't mention compression at all so the behavior does look just undefined generally speaking. I think we at least need to clarify it.
I agree on this though! I'd suggest something like 1) compression type MAY
be specified via path extension (and here's a supported list of formats) and 2) when paths to archives are used, they MUST
only contain only one file.
It's standard for file names to include compression type in their extension (file1.csv.zip, file2.csv.gz)
I think it's more like a convention rather than a standard
Side note -- Does the sheetName property in Table Dialect also allow you to select particular tables in an SQL db? (are SQLite DBs considered as "spreadsheet" formats?)
No, we had table
property in the draft for SQL but I removed it for now to wait for an actual user request
I agree on this though! I'd suggest something like 1) compression type MAY be specified via path extension (and here's a supported list of formats) and 2) when paths to archives are used, they MUST only contain only one file.
I think, currently, we don't define anything regarding the form of resource.path
(regarding format or compression). We might consider adding compression information to https://datapackage.org/specifications/data-resource/#path-or-data-required. Personally, I don't have preferences -- requiring one file per archive seems a reasonable approach. My main point here is as "an implementor" I need some clear definition like "if it is an archived file, resource.path
MUST ends with .gz or .zip prefix indicating the compression algorithm" (and reading this sentence I still feel that a dedicated property might be kind better than parsing a path :smiley: )
No, we had
table
property in the draft for SQL but I removed it for now to wait for an actual user request
@roll you mean something like this would not be supported? I do use this internally and crafted this gist for a user query in Frictionless Slack.
@fjuniorr
We can totally add dialect.table
if there is a demand for it cc @dafeder
I think it's more like a convention rather than a standard
I agree, I was being sloppy with my language there :)
To be clear, I'm neutral on the resource.compression
property. My only slight preference here is that we choose to use the extension in the path, OR resource.compression
, but not both... that way they are not in competition. But I defer to stronger opinions on this.
My main point here is as "an implementor" I need some clear definition We might consider adding compression information to https://datapackage.org/specifications/data-resource/#path-or-data-required.
Agreed! I think that's a perfect place to put this info.
I do think enforcing one file per archive is a good idea (no innerPath
property). This way it's stays natural to specify multi-part resources with compression, and we don't need an exception for archives (e.g. "path" = ["file1.csv.gz", "file2.csv.gz"]
). We can always relax the one file per archive rule and add an innerPath
if there's demand for it later.
I also think it'd be nice to mention somewhere the practice of compressing entire data packages. (frictionless-py already supports this as well).
We can totally add dialect.table if there is a demand
I would also very much support this... I was surprised to see spreadsheet support & mention of sql databases, but no clear way to select an sql table :)
Overview
There is quite a simple recipe - https://datapackage.org/recipes/compression-of-resources/ - adding a new
resource.compression
property to the Data Resource spec. It's supported byfrictionless-py
.Note
It might make sense to consider frictionless-py's
resource.innerPath
as well for providing a path inside an archive.