dcmi / dctap

DC Tabular Application Profile
https://dcmi.github.io/dctap/
32 stars 10 forks source link

Transform3: TAP shape/statement structure #30

Closed kcoyle closed 3 years ago

kcoyle commented 3 years ago
tombaker commented 3 years ago

@kcoyle

If no shape, ok, but CSV must have at least one propertyID (OK) test Done

For each shape, shape must have at least one propertyID Hm - we do not want to tolerate empty shapes?

A shape must be on contiguous rows - same shapeID cannot be used with intervening shapeIDs (ERROR) test I disagree with this. I think we should anticipate scenarios in which CSV rows are streaming in in an unordered way.

error message - continue? or drop shapes that are repeated erroneously, with their properties? or end? Or simply aggregate the statement constraints under the shapes, even if they are ingested out of sequence.

ShapeID cell blank after ShapeID filled in (OK) test Done

Mix of blank cells and filled in cells for shape (OK) test On the basis of yesterday's discussion, I suggest we not try to consider different shape labels, in the absence of shape identifiers, as stand-ins for different shapes. In the absence of any shape identifier (eg, no 'shapeID' column), the inferred shape should be assigned the (configurable) default identifier. If subsequent shape labels are encountered, they could either be ignored, or they could "clobber" (replace) the shape label previously assigned. Ultimately, I think that how we handle this matters less than providing the DCTAP author with enough feedback in the form of good warnings for them to spot the problem and fix it.

philbarker commented 3 years ago

For each shape, shape must have at least one propertyID

Hm - we do not want to tolerate empty shapes?

I think this depends on an issue that we have ducked (as far as I recall) so far: whether shapes are open or closed. Does an empty shape mean the instance data may have no properties, or that there are no contraints on what properties it may have (i.e. anything goes) .

tombaker commented 3 years ago

That's an interesting question but I don't think we should even try to answer it because the semantics of DCTAP are informal.

Any more rigor needs to be supplied by ShEx or SHACL or the like. To me that's not a bug but a feature. Going much further would lead us off into the weeds.

That said, I'd be in favor of adding an optional column for closed or not, as long as we do not try to nail that down too precisely.

kcoyle commented 3 years ago

A shape must be on contiguous rows

I can see your point, @tombaker. I do worry, though that because we allow blank shapeID cells to be interpreted as defaulting to the nearest above shapeID that mixing up shape statements in the table can be dangerous. Perhaps what we should do is recommend, in the documentation, that shapes be on contiguous rows to avoid ambiguity or mistakenly assigning a statement to a shape. Kind of a "best practice", easier to read and comprehend.

kcoyle commented 3 years ago

In the absence of any shape identifier (eg, no 'shapeID' column), the inferred shape should be assigned the (configurable) default identifier. If subsequent shape labels are encountered, they could either be ignored, or they could "clobber" (replace) the shape label previously assigned.

@tombaker I included this situation based on my experience with library data where there has been no distinction between display strings and identifiers. Said another way, the display strings have been considered sufficient as identifiers. This means that there are folks who have been trained to see labels as identifiers, and who do not have readily at hand a related ID. I think this goes also for propertyIDs and propertyLabels.

While it may not be desirable to forge IDs out of labels, could there at least be a warning message that labels were found but there was no related identifier?

That said, I don't think that this is necessary for our ld4 deadline.

philbarker commented 3 years ago

That said, I'd be in favor of adding an optional column for closed or not, as long as we do not try to nail that down too precisely.

(apologies for the number of side issues I am intreducing here, but) I'm increasingly of the opinion that we should have separate tables for property constraints and shape descriptions. Most of what is currently in a DCTAP is about property constraints (including what shape is the property contraint associated with). There's a lot that can be said about Shapes, currently we only have space for a Label, but just from working with SHACL I've seen the need for description, whether it is open/closed (in SHACL terms) and association with entity type/property value. Tom mentioned elsewhere that the rows in teh DCTAP are getting unconfortably long even with just the column headings, so I think it would be infeasible to add more columns; besides one table about property constraints and another about shape descriptions seems a clean approach. When working with spreadsheet software it has the advantage that the shapeIDs in the Shape Description sheet can provide data validation for the values used in the Property Constraints sheet. For an example see the "tap" and "shapes" tabs in the Credential Engine AP that I did in Google Sheets.

kcoyle commented 3 years ago

@philbarker I think this is interesting, so I'm opening a separate issue for it. I also have ideas for separate tables ...

tombaker commented 3 years ago

+1

On Thu, Jun 10, 2021, 18:58 Karen Coyle @.***> wrote:

A shape must be on contiguous rows

I can see your point, @tombaker https://github.com/tombaker. I do worry, though that because we allow blank shapeID cells to be interpreted as defaulting to the nearest above shapeID that mixing up shape statements in the table can be dangerous. Perhaps what we should do is recommend, in the documentation, that shapes be on contiguous rows to avoid ambiguity or mistakenly assigning a statement to a shape. Kind of a "best practice", easier to read and comprehend.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dcmi/dctap/issues/30#issuecomment-858786995, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIOBJSZLBSPXY7FX2KKLCDTSDVKXANCNFSM45U55B2A .

tombaker commented 3 years ago

@philbarker If we embrace the idea that a DCTAP can be expressed in a tabbed spreadsheet (and not just single two-dimensional CSV), then I agree that shape descriptions and statement constraint (ie, property constraint) descriptions could be put onto separate tabs.

I think it would be infeasible to add more columns

I would not however want us to leave the "single CSV" model behind. It may not be a good idea to pack too much information into horizontal rows, but doing so is not inherently infeasible. One could define a handful of shape elements (eg, "closed", "start"...) and still fit them into a row.

one table about property constraints and another about shape descriptions seems a clean approach.

I agree that addressing the support of tabbed spreadsheets is a good idea. People will do this anyway, so we might as well say so and provide some examples. A single tabbed spreadsheet could hold all of the information we have said should be in a manifest or configuration file, such as prefixes, defaults, extended sets of supported value constraint types, and separate descriptions for shapes and for statement constraints.

I guess this would break interoperability between DCTAP instances, but I have never really believed that interoperability of CSVs on the basis of DCTAP was all that realistic, or even desirable as a goal.

kcoyle commented 3 years ago

See dctap-python documentation for final resolution.