Splitting a single logical resource across multiple files (was: Group of resources + remote schema)

paulgirard commented 4 years ago

Dear all,

Resources can be grouped. To leverage splitting data into multiple files having the same structure. I need this to cut my dataset by archival sources used. First because it's easier to version, then it's easier when editions are needed to correct (not have to open a HUGE CSV), last it makes validations errors located to more precise files and finally it makes much more sense for a dataset which has been created by transcribing many different archival books.

About resources group, I think it's kind of accepted that for a set of resources gathered in one same group, the schema can be set only for the first resource of the group. If not set in specs (there are not specs for group as far as I know), I've seen code which load package that way.

My current issue is that with a package which has a group of resources, if a schema is indicated for all the resources of the group (which are numerous in my case > 1000) then the lib will load the same schema as many times as the number or resources. Which is not ideal... above all if the basepath is actually remote... Thus I see two ways out :

a change my package by removing schema in my grouped resources but the first one
I update the datapackage-js to load only the first schema is resources are in a group

The affected datapackage : https://github.com/medialab/ricardo_data/blob/master/datapackage.json

I am about to try the first path by removing the schema in all but first grouped resource. Changing datapackage-js to add a special behaviour for group should not be that hard.

rufuspollock commented 4 years ago

@paulgirard great question. First some relevant existing issues:

Pattern for organizing (large) datasets into data packages https://github.com/frictionlessdata/specs/issues/620
Chunks / Parts https://github.com/frictionlessdata/specs/issues/228 - this got implemented into the spec see https://frictionlessdata.io/specs/data-resource/#data-in-multiple-files

If i understand it you have what sounds like a single "Logical" resource that is split into multiple physical files.

The options are then:

Use the Parts / Chunks approach
Cache the remote schema to avoid network calls
Move the schema "in data package" but (not in each of the 1000 resources)

paulgirard commented 4 years ago

Thank you ! I thought chunks implied not repeating teh headers which was a no-go for me. If it's possible with keaders, it's nicer solution cause it makes a much lighter datapackage. Note: it's not a bad idea to use group to specify source information. In my case it's ok because my file names are built from the source name.

I'll read over the issues to learn more but I bet chunks might do it.

ps: i've tried to specify the remote schema only for the first resource and it went fine. So that's a safe workaround I guess.

paulgirard commented 4 years ago

Dear @roll @rufuspollock I am trying to use a multipart set-up. I don't understand which spec element I should use to declare that the chunks all have headers. By default the python lib consider that every chunks don't have headers. I couldn't find how to deal with headered chunks in the specification.

I can read the concept and even a warning about headers here : https://frictionlessdata.io/specs/data-resource/ But no more information about multipart and chunks ?

I tried to add a dialect.header = true but that's not it.

Here is my multipart resource:

  {
      "name": "flows",
      "sources": [
        {
          "title": "Trade statistics from various countries archives and research papers, see data/sources.csv for details"
        }
      ],
      "encoding": "utf-8",
      "profile": "tabular-data-resource",
      "title": "Trade flows transcribed from various international trade statistics volumes",
      "format": "csv",
      "schema": "flows_schema.json",
      "dialect": {
        "header": true
      },
      "path": [
        "data/flows/AnnalesDuCommerceExterieurFaitsCommerciaux3eSerieAvisDivers_IndesOrientalesNeerlandaises_Fc2.csv",
        "data/flows/ForeignCommerceAndNavigationOfTheUnitedStates_191112.csv",
        "data/flows/AnnalesDuCommerceExterieurFaitsCommerciaux3eSerieAvisDivers_OceanieEtAustralie_Fc7.csv",
        "data/flows/ForeignCommerceAndNavigationOfTheUnitedStates_1938.csv",
]}

rufuspollock commented 4 years ago

@paulgirard great question 👍 I think this is under-specified in the spec 😉This was actually discussed at some length in original speccing here https://github.com/frictionlessdata/specs/issues/228

The recommendation at that time was to do something explicit but this did not make it into the spec - i think we probably want something. My sense is that:

default behaviour should be to assume that all files have a single header row (or rather that all files conform to csv dialect spec which includes option of 0 or 1 header rows)
- Aside: i think it would be nice to add arbitrary header rows amount to csv dialect but that's something else
add an explicit option like noChunkHeaders ...

Here's the thread summarized:

https://github.com/frictionlessdata/specs/issues/228#issuecomment-177621157

What do we about repeated headers e.g. in CSV files? Not so sure. Feel this is properly a separate option e.g. stripChunkHeaders which mean you should take first line of all but first chunk and throw it away (first line being first thing up to \n or \r\n. Thoughts welcome

https://github.com/frictionlessdata/specs/issues/228#issuecomment-177700695

It's easy to determine whether a header is repeated or not. I don't think we need an explicit property. It can be done automatically.

https://github.com/frictionlessdata/specs/issues/228#issuecomment-179098800

My feeling is to default to stripping the first line in subsequent files the way csvstack does. After all, what is a CSV file with no header :)? In that case, of course, we would need a csvsplit utility.

Rufus' final summary

Stripping headers: I think we want an explicit boolean property so that is not something left to guessing by external tools (though such tools are welcome to check / guess)

...

The one remaining question is what to do with headers:

https://github.com/onyxfish/csvkit/issues/188 - csvkit supports --no-header now

Postgres COPY command as used in Redshift has "ignoreheader" as a number which is number of rows to skip: http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-ignoreheader

My sense here is that in Data Package behaviour is naive: simple concatenation. In Tabular Data Package where there could be a concept of skipInitialRows or headerRow etc - see #326

paulgirard commented 4 years ago

I propose to prepare a PR about this starting with python libs.

But let's first agree on a solution. Important point about this choice is that current implementations consider there are no headers in chunks. Changing the default to headers on would break this behaviour!

Based on existing discussion summed up by @rufuspollock stated, I think we have three solutions:

solution 1 - chunks are CSV after all

Multipart's chunks are considered to have a header even if it's a chunk of a multipart resource. Reading chunks should then check header compliance before concatenating. We might want to introduce a "no-header" option to multipart resource which would change this. Actually the dialect can already specify no to the header option.

"dialect": {
        "header": true
},

Should we just use this to define chunks concatenation behaviour? If we do so I feel like the very first chunk should follow the same rule as others. If not I prefer solution 2. If we take this path we IMHO don't have choice but to reverse default to with-header since we stick to CSV spec.

solution 2 - specific header option for multipart

In case the no-header in multipart situations should stay at the current spec (ie. only the first chunk has header), I would prefer to add a multipart specific option in resource like noHeadersInChunks:boolean. If we take this path, still have to decide what's the default. Since the feeling of this option is what we called multipart and chunks are specific, I think keeping default to noHeadersInChunks:true is quite ok not to disturb existing packages.

solution 3 - magic default

Since we have the specification of what a header line must be, we could test first line of every chunk is it's a header one or not. I generally don't like this magical behaviour but :

we have a spec so this magic is very close to an identity test (test already implemented)
it would make life easier by not breaking current behaviour and work seamlessly for people with headered-chunks.

I need this headered-chunks behaviour and I can take time to implement it at least in Python lib. @rufuspollock @roll please let me know what decision you take about the specification.

IMHO having chunks where only the first one have header doesn't feel right. I understand dependency to the path issue though. Solution one with default to true feels like the best one to me because it's not magic, it's coherent CSV-wise and people who would have to migrate could do it by only removing the header line in first chunk and adding dialect.header=false in spec. Solution 3 is appealing but looks to close to the dangerous dont-try-to-be-too-clever zone. Solution 2 is enforcing that multipart and chunk are specific situation with specific behaviour.

Let me know. Would be a pleasure to PR any of the solution mentioned here.

roll commented 4 years ago

I think before any actions are taken it should be clarified on the specs level.

At the moment multipart resource is a part of the Resource spec it means that it's applicable to all possible file types.

For example 1:

path:
- picture100kb-1.jpeg
- picture100kb-2.jpeg

An implementation should concatenate it on a binary level to get picture.jpeg.

Not let's say we have example 2:

csv-chunk-1

header1, header2, header3
value1,value2,value3
...
value1, val

csv-chunk-2

ue2,value3
...
value1,value2,value3

Our implementation is going to do the same as with the picture (binary level concatenation) and we will get proper table.csv. The second part of the file is not even a tabular file, it's not properly parsable (it's a chunk of a big tabular file).

Let's have example 3:

csv-chunk-1

header1, header2, header3
...

csv-chunk-2

header1, header2, header3
value1,value2,value3
...

Not we can't concatenate it as binary chunks because there is the headers row. So we need to parse as a table both chunks. I don't see how an implementation should understand it (binary concatenation vs logical concatenation)

paulgirard commented 4 years ago

What about adapting the concatenation implementation when the resource has a tabular-data-resource profile ?

roll commented 4 years ago

@paulgirard Do you mean making it a special case?

paulgirard commented 4 years ago

Yes. tabular-data-resource => logical concatenation (i.e. handling header row) leaving other cases on binary level. That would not shock me architecturally speaking.

roll commented 4 years ago

@paulgirard Probably. TBH not fully sure wherever there are any gotchas for having the different meanings of the path property for different resource types. I think the specs need to clarify it first anyway

paulgirard commented 4 years ago

@roll @rufuspollock Well, let me know when the specs are clarified then. I can help for the implementation once the plan is clear. Meanwhile I will use group of resources instead as I was doing before.

Note: I have to say that I am not very happy with this situation since I already switched to multipart in part of my projects. My bad I should have checked before. I would recommend Updating the header/multipart issue in this page : https://frictionlessdata.io/specs/data-resource/

" NOTE: all files in the array MUST be similar in terms of structure, format etc. Implementors MUST be able to concatenate together the files in the simplest way and treat the result as one large file. For tabular data there is the issue of header rows. See the Tabular Data Package spec for more on this. "

I didn't find any info in the tabular data package spec about the header issue. Would be good to clearly state that multipart implies no headers for tabular resource for now.

rufuspollock commented 4 years ago

@roll there is no big issue here i think. The existing spec already says by default you do simple concatenation (see https://frictionlessdata.io/specs/data-resource/). What's missing is the special case for tabular data.

@paulgirard I think we go for option 1 for now and if there is real demand we add option to specify that files other than first one are different (e.g. don't have headers). In short, expanding on what I said above (and you expanded in option 1):

default behaviour should be to assume [for tabular data] that all files in parts conform to specified (or implict i.e. default values) csv dialect spec.

Default is that there is a header row and therefore that every file in parts has a header row.

PS: I wrote this before seeing your latest comment (3m ago!). I think you are good to go on implementing this - and i don't even think it is a change in the specs (though i will add clarification as it was simply a bug that that this was not specified in tabular data resource!)

paulgirard commented 4 years ago

Ok I'll work on option 1 and try to find a proper way to distinguish tabular resource. Profile looks like the good one even though it's not a mandatory field... @roll I understand this issue... If you have more suggestion/ideas/recommendations/warnings please share.

I'll let you know in this thread before issuing a PR.

roll commented 4 years ago

@paulgirard Consider creating an issue for datapackage-py when you can share your findings regarding the implementation. I think there are many hidden things to take into account e.g. package.save/resource.raw_read/etc for multipart-resource etc. Currently, it relies on the concept of a byte stream making it work automatically (chunks are encapsulated). And having logical chunks will require the following changes to the stack.

A good thing is that multipart resources are only implemented in Python as far as I know.

Also, cc @pwalsh as one of the leads of the multipart resources feature and @akariv. I think this thread is important to review by all core-team members.

rufuspollock commented 4 years ago

@roll just FYI i was the lead of the multipart feature (i was pretty much the lead for all of it :wink:)

I agree this could be a complex change so let's see how it goes ...

rufuspollock commented 4 years ago

@paulgirard how did you get on here?

paulgirard commented 4 years ago

Sorry for having disappeared. I started my own company two weeks before french COVID lockdown.. such a mess. I still plan to finish up the feature by adding the magic + warning features we discussed with @roll on the PR comments : https://github.com/frictionlessdata/datapackage-py/pull/257#pullrequestreview-375804017

I will need a few more weeks before I can finish that up. Feel free to assign some of your team if you want to go faster no prb for me !

I'll let you know.

paulgirard commented 4 years ago

@rufuspollock @roll I finally took time to finish up this work on multipart. I merged master and added a commit to better handled deprecated header chunks as requested. Let me know if you can reopen the PR or If I should create a new one. Reopening will have the benefit to keep previous discussion in history.

roll commented 4 years ago

@paulgirard I've reopened :+1: Thanks!

frictionlessdata / frictionlessdata.io