edi3 / edi3-json-ld-ndr

GNU General Public License v3.0
0 stars 2 forks source link

representation and linking of Code lists #11

Open onthebreeze opened 4 years ago

onthebreeze commented 4 years ago

The supply chain reference data model is full of properties that have enumerated value domains. Country codes (eg "AU" = "Australia") and units of measure (eg "KG" = "Kilogram") are examples that we can all relate to but there are many other critical codes like incoterms, locodes, and others. Taken together these are of equal importance semantically to the reference data model itself.

So - two questions:

  1. How best to publish these codes? one obvious example is as per https://edi3.org/specs/edi3-codelists/develop/specification/ - with example https://codelists.api.edi3.org/recommendation-20/unitOfMeasure. so this is a json schema / api representation - which is perfectly acceptable.
  2. How best to link these value domains (aka code lists) to the vocabulary property terms reference them? eg in the snippet below from https://edi3.org/vocab/bsp.jsonld, there is rdfs:range: "xsd:token". could that / should that instead reference a URI of the country code list? if so what would it look like?
@id: "edi3:BirthAddressCountryCode",
@type: "rdfs:Property",
rdfs:range: "xsd:token",
rdfs:comment: "The identifier of a country for this birth address.",
rdfs:domain: "edi3:BirthAddress",
rdfs:label: "Birth_ Address. Country. Identifier",
edi3:cefactID: "UN01003172"
},
Fak3 commented 4 years ago

The best approach will vary. Taking the ISO country code as an example, it is a flat set with small number of members, each of them is 2-letter relatively easy recognizable by human. New country codes are rarely being added or removed, but when it happens it is the only major challenge for digital data exchange, and not directly related to the representation of the codes in json-ld payload. So I can't immediately see any significant disadvantage of simple 2-letter string representation.

For the different example, the UNECE Rec 21 codes for types of cargo seems to conflate multiple cargo properties like package shape, size, material, fragility into one flat list of completely arbitrary 2-letter codes, which is hard to understand for human, and hard to process in the application business logic. I believe it would be much easier to use if it was divided into few code lists of distinct cargo package attributes, so that cargo data could be represented with json(+ld) like this:

{
 "edi3:package": {
   "@type": [ "rec21:BasePackage", "rec21:Flexibag" ],
   "edi3:material": [ "rec21:steel", "rec21:plastic" ],
   "edi3:fragilityClass": "rec21:FG0"
 }
}

The properties i used in the example above may not be the most appropriate, it is only to demonstrate that structured like this the data will be a lot easier to consume, comprehend and implemented in the application business logic.

Realizing that such overhaul of UNECE codelists will not happen soon, the existing one could probably be helped by adding http url for codelist members, e.g. http://unece.org/codelists/rec21#FE or maybe even more human-readable one, like _http://unece.org/codelists/rec21#Case_with_pallet_base_cardboard_. Dereferencing this url in the web browser should result in html page describing this codelist member. Dereferencing this url with http header accept:application\json+ld should result in machine-readable representation of this codelist in flattened graph json-ld form:

{
 "@context": {
   "edi3": "https://edi3.org/vocab#",
   "rec21": "https://unece.org/codelists/rec21#"
 }
 "@graph": [
   {
     "@id": "rec21:Case_with_pallet_base_cardboard",
     "@type": "edi3:UNECERec21Code",
     "rdfs:comment": "Case, with pallet base, cardboard",
     "rdf:value": "EF"
   },
   ...
 ]
}
onthebreeze commented 4 years ago

Yes it's true that some of the UN code-lists are a confused mish-mash of codes that describe different properties (package codes and status codes are good examples). fixing that is a job for a governance cycle in a later phase. Some are ok like units of measure codes. For now let's focus on how to represent a code list, whether it is semantically good or bad.

Some further comments & questions:

Units of Measure

CommonCode: "28",
ConversionFactor: "kg/m²",
Description: "",
Level/Category: "1",
Name: "kilogram per square metre",
Status: "",
Symbol: "kg/m²"

can we see some examples of how to handle arbitrary properties like "symbol" in the UOM codes and also hierarchies like in the WTO tariff codes?

Fak3 commented 4 years ago

using the description as the @id as per "@id": "rec21:Case_with_pallet_base_cardboard" could be a bit problematic because it implies some governance control at the description instead of the code level. is the reason you don't like @id": "rec21:EF" because it is meaningless without reference to the description?

I would prefer to see identifier in the data that I can make some sense of without having to consult any additional resource. Not sure about the governance of Rec21 descriptions, but if it already has a policy to keep the description short (which they seem to have), then it only needs to be made unique, which could be achieved by appending the actual unique 2-character code to the end of identifier: rec21:Case_with_pallet_base_cardboard_EF.

how do you think we should handle codes that have multiple properties - often specific to the code list

{
  "@id": "rec20:kilogram_per_square_meter",
  "@type": "edi3:NormativeUnit",  
  "rdf:label": "kilogram per square metre",
  "rdf:comment": "Unit of surface density, areic mass",
  "edi3:uneceRec20Code": "28",
  "edi3:conversionFactor": "kg/m²",
  "edi3:unitSymbol": "kg/m²"
}

Level\Category, which actually indicates normative status should probably go into the @type: 1 == NormativeUnit, 2 == NormativeEquivalentUnit, 3 == InformativeUnit, which are all subclasses of edi3:MeasurementUnit

some code lists are hierarchical - for example WTC Harmonised system (tariff codes) - very important in international trade and used in many places. See http://tariffdata.wto.org/ReportersAndProducts.aspx. Note that "10" is "Cereals", "10.06" is "rice" and "10.06.20" is brown rice. In once sense the code is a flat list because each code is unique - but there is a logical hierarchy encoded in the list. How best to represent this kind of thing?

Harmonized system is broken just like UNECE rec21, but at a larger scale. It mixes materials, practical applications, size, shape, enviromental threats and lots of other attributes into a tree, which have such domain-mixing misconceptions at all its levels. It seems pointles to directly model it with heirarchy of rdfs classes. So I believe we can treat it as flat list just like rec20 in the example above.

AP-G commented 3 years ago

I'd like to add one more dimension to this discussion, as it is very relevant in some projects I am dealing with currently: The CEFACT RDM specifies the code list type to include several attributes like

And for IDs e.g.

The current NDR omits all of them. I understood your arguments that those are not needed any more, as a URN is used for identification. From my point of view this makes perfectly sense. But it leads to some consequences:

But then the documentation must define this and the UN-code lists have to be aligned. So, the rdf:value would only be reduced to a "historical documentation purpose" and a mapping help for legacy systems. And the @id is the relevant part to use.

The tricky part starts then with the use of non-CEFACT code lists and globally specified IDs.

To give you two examples:

  1. There exists a property defining a colour of a product. Many different code lists are used to define colours, like RAL, Pantone, ... The choice if often dependent on the industry. As a consequence, transformation rules need to be specified how to create the correct URN for a RAL code list. As the first part of the URN should be the authority, the best would be to let the issuing authorities define the transformation. But this is not realistic, I think. Alternatively, a way could be to create a rule including the attributes from the RDM in the URN. E.g. unclgeneric:{listID}:{agencyID}:{version}:{Content}

  2. There exists a huge number of globally standardized identifiers. A list of issuers is defined in the ICD list (ISO 6523) with a bit more than 200 entries. For instance, in Europe the use of this list is mandatory in electronic invoiced from business to public authorities. Just to name two industries: Consumer goods uses global trade item numbers and global location numbers that are identified with two different schemeIDs. The automotive industry on the other hand uses DUNS numbers, that are identified as a separate schemeID as well. Suppliers of the automotive industry often provide both IDs. Or the PEPPOL network needs its own standardised IDs in addition to route the information to the correct recipient. Again: The consequence has to be that someone needs to define the transformation from all those hundreds of globally standardised code lists and/or identifiers to a URN. This is possible, but will very likely make implementation of the vocabulary much more complicated.

Any comments or ideas to solve this?