Closed paulgirard closed 3 years ago
@paulgirard great question. First some relevant existing issues:
If i understand it you have what sounds like a single "Logical" resource that is split into multiple physical files.
The options are then:
Thank you ! I thought chunks implied not repeating teh headers which was a no-go for me. If it's possible with keaders, it's nicer solution cause it makes a much lighter datapackage. Note: it's not a bad idea to use group to specify source information. In my case it's ok because my file names are built from the source name.
I'll read over the issues to learn more but I bet chunks might do it.
ps: i've tried to specify the remote schema only for the first resource and it went fine. So that's a safe workaround I guess.
Dear @roll @rufuspollock I am trying to use a multipart set-up. I don't understand which spec element I should use to declare that the chunks all have headers. By default the python lib consider that every chunks don't have headers. I couldn't find how to deal with headered chunks in the specification.
I can read the concept and even a warning about headers here : https://frictionlessdata.io/specs/data-resource/ But no more information about multipart and chunks ?
I tried to add a dialect.header = true but that's not it.
Here is my multipart resource:
{
"name": "flows",
"sources": [
{
"title": "Trade statistics from various countries archives and research papers, see data/sources.csv for details"
}
],
"encoding": "utf-8",
"profile": "tabular-data-resource",
"title": "Trade flows transcribed from various international trade statistics volumes",
"format": "csv",
"schema": "flows_schema.json",
"dialect": {
"header": true
},
"path": [
"data/flows/AnnalesDuCommerceExterieurFaitsCommerciaux3eSerieAvisDivers_IndesOrientalesNeerlandaises_Fc2.csv",
"data/flows/ForeignCommerceAndNavigationOfTheUnitedStates_191112.csv",
"data/flows/AnnalesDuCommerceExterieurFaitsCommerciaux3eSerieAvisDivers_OceanieEtAustralie_Fc7.csv",
"data/flows/ForeignCommerceAndNavigationOfTheUnitedStates_1938.csv",
]}
@paulgirard great question 👍 I think this is under-specified in the spec 😉This was actually discussed at some length in original speccing here https://github.com/frictionlessdata/specs/issues/228
The recommendation at that time was to do something explicit but this did not make it into the spec - i think we probably want something. My sense is that:
noChunkHeaders
... Here's the thread summarized:
https://github.com/frictionlessdata/specs/issues/228#issuecomment-177621157
What do we about repeated headers e.g. in CSV files? Not so sure. Feel this is properly a separate option e.g. stripChunkHeaders which mean you should take first line of all but first chunk and throw it away (first line being first thing up to \n or \r\n. Thoughts welcome
https://github.com/frictionlessdata/specs/issues/228#issuecomment-177700695
It's easy to determine whether a header is repeated or not. I don't think we need an explicit property. It can be done automatically.
https://github.com/frictionlessdata/specs/issues/228#issuecomment-179098800
My feeling is to default to stripping the first line in subsequent files the way csvstack does. After all, what is a CSV file with no header :)? In that case, of course, we would need a csvsplit utility.
Rufus' final summary
Stripping headers: I think we want an explicit boolean property so that is not something left to guessing by external tools (though such tools are welcome to check / guess)
...
The one remaining question is what to do with headers:
- https://github.com/onyxfish/csvkit/issues/188 - csvkit supports
--no-header now
- Postgres COPY command as used in Redshift has "ignoreheader" as a number which is number of rows to skip: http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-ignoreheader
My sense here is that in Data Package behaviour is naive: simple concatenation. In Tabular Data Package where there could be a concept of
skipInitialRows
orheaderRow
etc - see #326
I propose to prepare a PR about this starting with python libs.
But let's first agree on a solution. Important point about this choice is that current implementations consider there are no headers in chunks. Changing the default to headers on would break this behaviour!
Based on existing discussion summed up by @rufuspollock stated, I think we have three solutions:
Multipart's chunks are considered to have a header even if it's a chunk of a multipart resource. Reading chunks should then check header compliance before concatenating. We might want to introduce a "no-header" option to multipart resource which would change this. Actually the dialect can already specify no to the header option.
"dialect": {
"header": true
},
Should we just use this to define chunks concatenation behaviour? If we do so I feel like the very first chunk should follow the same rule as others. If not I prefer solution 2. If we take this path we IMHO don't have choice but to reverse default to with-header since we stick to CSV spec.
In case the no-header in multipart situations should stay at the current spec (ie. only the first chunk has header), I would prefer to add a multipart specific option in resource like noHeadersInChunks:boolean. If we take this path, still have to decide what's the default. Since the feeling of this option is what we called multipart and chunks are specific, I think keeping default to noHeadersInChunks:true is quite ok not to disturb existing packages.
Since we have the specification of what a header line must be, we could test first line of every chunk is it's a header one or not. I generally don't like this magical behaviour but :
I need this headered-chunks behaviour and I can take time to implement it at least in Python lib. @rufuspollock @roll please let me know what decision you take about the specification.
IMHO having chunks where only the first one have header doesn't feel right. I understand dependency to the path issue though. Solution one with default to true feels like the best one to me because it's not magic, it's coherent CSV-wise and people who would have to migrate could do it by only removing the header line in first chunk and adding dialect.header=false in spec. Solution 3 is appealing but looks to close to the dangerous dont-try-to-be-too-clever zone. Solution 2 is enforcing that multipart and chunk are specific situation with specific behaviour.
Let me know. Would be a pleasure to PR any of the solution mentioned here.
I think before any actions are taken it should be clarified on the specs level.
At the moment multipart resource is a part of the Resource
spec it means that it's applicable to all possible file types.
For example 1
:
path:
- picture100kb-1.jpeg
- picture100kb-2.jpeg
An implementation should concatenate it on a binary level to get picture.jpeg
.
Not let's say we have example 2
:
csv-chunk-1
header1, header2, header3 value1,value2,value3 ... value1, val
csv-chunk-2
ue2,value3 ... value1,value2,value3
Our implementation is going to do the same as with the picture (binary level concatenation) and we will get proper table.csv
. The second part of the file is not even a tabular file, it's not properly parsable (it's a chunk of a big tabular file).
Let's have example 3
:
csv-chunk-1
header1, header2, header3 ...
csv-chunk-2
header1, header2, header3 value1,value2,value3 ...
Not we can't concatenate it as binary chunks because there is the headers row. So we need to parse as a table both chunks. I don't see how an implementation should understand it (binary concatenation vs logical concatenation)
What about adapting the concatenation implementation when the resource has a tabular-data-resource profile ?
@paulgirard Do you mean making it a special case?
Yes. tabular-data-resource => logical concatenation (i.e. handling header row) leaving other cases on binary level. That would not shock me architecturally speaking.
@paulgirard
Probably. TBH not fully sure wherever there are any gotchas for having the different meanings of the path
property for different resource types. I think the specs need to clarify it first anyway
@roll @rufuspollock Well, let me know when the specs are clarified then. I can help for the implementation once the plan is clear. Meanwhile I will use group of resources instead as I was doing before.
Note: I have to say that I am not very happy with this situation since I already switched to multipart in part of my projects. My bad I should have checked before. I would recommend Updating the header/multipart issue in this page : https://frictionlessdata.io/specs/data-resource/
" NOTE: all files in the array MUST be similar in terms of structure, format etc. Implementors MUST be able to concatenate together the files in the simplest way and treat the result as one large file. For tabular data there is the issue of header rows. See the Tabular Data Package spec for more on this. "
I didn't find any info in the tabular data package spec about the header issue. Would be good to clearly state that multipart implies no headers for tabular resource for now.
@roll there is no big issue here i think. The existing spec already says by default you do simple concatenation (see https://frictionlessdata.io/specs/data-resource/). What's missing is the special case for tabular data.
@paulgirard I think we go for option 1 for now and if there is real demand we add option to specify that files other than first one are different (e.g. don't have headers). In short, expanding on what I said above (and you expanded in option 1):
default behaviour should be to assume [for tabular data] that all files in parts conform to specified (or implict i.e. default values) csv dialect spec.
Default is that there is a header row and therefore that every file in parts has a header row.
PS: I wrote this before seeing your latest comment (3m ago!). I think you are good to go on implementing this - and i don't even think it is a change in the specs (though i will add clarification as it was simply a bug that that this was not specified in tabular data resource!)
Ok I'll work on option 1 and try to find a proper way to distinguish tabular resource. Profile looks like the good one even though it's not a mandatory field... @roll I understand this issue... If you have more suggestion/ideas/recommendations/warnings please share.
I'll let you know in this thread before issuing a PR.
@paulgirard
Consider creating an issue for datapackage-py
when you can share your findings regarding the implementation. I think there are many hidden things to take into account e.g. package.save
/resource.raw_read
/etc for multipart-resource etc. Currently, it relies on the concept of a byte stream making it work automatically (chunks are encapsulated). And having logical chunks will require the following changes to the stack.
A good thing is that multipart resources are only implemented in Python as far as I know.
Also, cc @pwalsh as one of the leads of the multipart resources feature and @akariv. I think this thread is important to review by all core-team members.
@roll just FYI i was the lead of the multipart feature (i was pretty much the lead for all of it :wink:)
I agree this could be a complex change so let's see how it goes ...
@paulgirard how did you get on here?
Sorry for having disappeared. I started my own company two weeks before french COVID lockdown.. such a mess. I still plan to finish up the feature by adding the magic + warning features we discussed with @roll on the PR comments : https://github.com/frictionlessdata/datapackage-py/pull/257#pullrequestreview-375804017
I will need a few more weeks before I can finish that up. Feel free to assign some of your team if you want to go faster no prb for me !
I'll let you know.
@rufuspollock @roll I finally took time to finish up this work on multipart. I merged master and added a commit to better handled deprecated header chunks as requested. Let me know if you can reopen the PR or If I should create a new one. Reopening will have the benefit to keep previous discussion in history.
@paulgirard I've reopened :+1: Thanks!
Dear all,
Resources can be grouped. To leverage splitting data into multiple files having the same structure. I need this to cut my dataset by archival sources used. First because it's easier to version, then it's easier when editions are needed to correct (not have to open a HUGE CSV), last it makes validations errors located to more precise files and finally it makes much more sense for a dataset which has been created by transcribing many different archival books.
About resources group, I think it's kind of accepted that for a set of resources gathered in one same group, the schema can be set only for the first resource of the group. If not set in specs (there are not specs for group as far as I know), I've seen code which load package that way.
My current issue is that with a package which has a group of resources, if a schema is indicated for all the resources of the group (which are numerous in my case > 1000) then the lib will load the same schema as many times as the number or resources. Which is not ideal... above all if the basepath is actually remote... Thus I see two ways out :
The affected datapackage : https://github.com/medialab/ricardo_data/blob/master/datapackage.json
I am about to try the first path by removing the schema in all but first grouped resource. Changing datapackage-js to add a special behaviour for group should not be that hard.