CSV does not implicitly copy values from above rows to empty cells

dcmi / dctap

DC Tabular Application Profile

https://dcmi.github.io/dctap/

33 stars 10 forks source link

CSV does not implicitly copy values from above rows to empty cells #26

Closed bencomp closed 3 years ago

bencomp commented 3 years ago

When the specification notes:

Because there is often more than one property for a shape, and because there must be a template row for each property, repeating the shape identifier and label in the profile is optional. It is assumed that all property rows following a row that includes a shape identifier are properties within that shape.

… how would a processor understand to repeat the shape ID and label from the rows above it? This behaviour is not native to CSV. (In OpenRefine, you could use the Fill down transformation to get the intended result.)

kcoyle commented 3 years ago

I'll let Tom, Nishad and Phil, who are writing code to do this, give their answers. I do have in mind some tests for this. I'll try to mock those up shortly.

philbarker commented 3 years ago

Hi @bencomp , the short answer is that the logic has to be programmmed into the processor. The code we have written so far parses CSVs sequentially by row from the top (using python's built-in CSV module), so this is just a matter of keeping tabs of what was the latest value for shapeID. I expect there will be situations where processing would not be sequential and so this might be more problematic. I have wondered about a macro to process a TAP directly in Google Sheets or Excel, that's not a type of scripting that I have much experience with but maybe then it would be harder, requiring some pre-processing like the fill-down transformation in OpenRefine.

Did you have any specific processing scenario in mind?

tombaker commented 3 years ago

@philbarker

situations where processing would not be sequential

In all of the use cases we have considered, processing would be sequential, so it would be great if we could say that this is the default.

I cannot think of cases where it would not be sequential, though that might just be my lack of imagination. If there were such cases, they would have to be pretty common, in my opinion, to justify changing the more "obvious" default.

philbarker commented 3 years ago

@tombaker OK, fair point. It would have been better if I have been more specific and said "sequential starting at the top" because it is the assumption that processing starts at the top that is probably more questionable once you go beyond imperative programming. But I think we can say that top-down preprocessing might be a necessary step before the statements in a profile can be considered as indepedent constraints.

tombaker commented 3 years ago

@philbarker

top-down preprocessing might be a necessary step

Is that any different from saying that "top-down processing might be a necessary first step"? The "pre" makes me unsure.

philbarker commented 3 years ago

I think sometimes it would be different. Using something like the OpenRefine fill down transformation might be separate manual step that needed doing before any macros and other code that assumed each row was self-contained would work.

bencomp commented 3 years ago

Thanks for the comments, all. In Python I tend to use Pandas to load CSVs, and Pandas has optimised ways to process rows in parallel. This of course requires that, like in relational databases, rows are independent of each other. I wasn't planning on processing any TAP using Pandas. Maybe it was just my experience teaching Data organisation in spreadsheets kicking in.

kcoyle commented 3 years ago

I think we've come to a comfort level with the answers here. @bencomp are you ok if we close this? Thanks.

kcoyle commented 3 years ago

Agreeing to close, although we should provide examples with all cells filled in.