Closed VladimirAlexiev closed 7 years ago
The E54 Dimension instance you suggest for attaching the display string to, does not exist and should not be instantiated. The string is related to the object as a whole and should be attached directly to the instance of E22 Man-Made Object with P3 and P3.1. This can be instantiated using the CRMpc extension which gives a robust rdf deployment method for the .1 properties. I would prefer that the instances of E54 Dimension that are created for the width and height of the Mat had labels of the form "Object 23 width of Mat". I am unclear if the metric dimensions are genuine measurements or just mathematical conversions (the number of decimal places suggests conversion). If they are mathematical conversions then they should probably be dropped. There is definitely more than a single instance of E16 Measurement; there should be one for each dimension actually present. so in this case there are 6 or 12 instances of E16 Measurement depending on if the metric measurements are measurements or simply conversions. If they are conversions and you wish to have them represented in the data rather than just creating them on the fly in the user interface. then you woul have 2 instances of E54 Dimension connected by 2 instances of P40 to the same instance of E16 Measurement and add a P2 has type [conversion] to the converted value (in this case the metric value). I would probably amend the label to "Object 23 width of Mat (converted from Imperial)" as well.
I am very curious about this CRMpc extension—is there any documentation on it that I can read? Google is failing me, as are both the search box at the new CIDOC site and the search at http://www.ics.forth.gr.
Try http://new.cidoc-crm.org/technical_papers and then modelling properties of properties. There is a presentation and the RDF HTH
Would it make sense to model it as a linguistic object, not just as a note?
:thing P129i_is_subject_of :dimension_string.
:dimension_string a E33_Linguistic_Object;
p3_has_note [TEXT];
P2_has_type aac:dimension_string.
It would also allow us to attach aboutness to the object.
I would also recommend looking at the http://qudt.org ontology for our units—the AAT definitions are good as terms, but they don't relate to anything that you'd need if you need to use the dimensions in any mathematical way, like unit conversion.
Yes, linear dimensions are in Centimeters. Yes, the one Weight=3.63 is Kilograms.
If you use P3 has note it captures the required sense. You would only instantiate an instance of E33 Linguistic Object if the text is documented in its own right as a subject in the domain of interest. So no, I am afraid, it does not make sense.
So the suggestion for the textual representation of dimension is:
:thing crmpc:P01i_is_domain_of :note_property;
:note_property a crmpc:PC3_has_note;
crmp:P03_has_range_literal "Image: 44.1 x 36.2cm...";
crmp:P3.1_has_type aac:dimension_string.
That look right to people?
Will we use this mapping for all notes, or only specific notes, and how will we determine when to use this pattern or when to use the straight P3_has_note
mapping?
At least in the Provenance Index, and I anticipate in the Museum, we're going to simplify to:
_:Object a E22_Man_Made_Object ;
schema:height [
a E54_Dimension ;
p90_value 71.12 ;
p91_has_unit qudt:cm ] ;
schema:width [
a E54_Dimension ;
p90_value 55.88 ;
p91_has_unit qudt:cm ] .
For different parts of the object, I like the proposal that David made to model it as different parts of the object :) Much simpler, just as expressive, provides better hooks for future work.
Rob>model it as different parts
As you see in the pivot, not all Elements are Parts. Some express qualifier or mode.
schema:height
This modeling doesn't say what was measured. As I wrote "it's crucially important that DimItemElemXrefID groups measurements of the same (object,element). Emitting the dimensions without this grouping would be useless (same as in JPGM)". See https://share.getty.edu/display/JPGLODM/JPGM+Dimensions for how the data looks in TMS, and why it is necessary to group the dimensions.
And does schema have props for all dimensions required across museums?
Steve>unclear if the metric dimensions are genuine measurements or just mathematical conversions
The metric values are the only values we got in the database.
David> crmpc:P01i_is_domain_of crmpc:PC3_has_note crmp:P03_has_range_literal
Did you make up these terms? Or is there an RDF definition of "CRMpc"?
Also see this comment in #20: "it is one of the most expensive ways, since it doubles the number of classes and triples the number of property types. There are better ways to attach type to a relation."
not all Elements are Parts
To strengthen this: @azaroth42, can you please model "Without Base" as an object part ;-)
See previous comment for link to RDF of CRMpc. The metric values are pretty obviously conversions and not real measurements (71.12014224!). The real measurements are then probably the Imperial measurements embedded in the text. How about parsing those?
Just got the updated RDF for CRMpc 1.1 https://www.dropbox.com/s/o8w8juaoci3lzo9/CRMpc_v1.1.rdfs?dl=0
@VladimirAlexiev: I'm not advocating for this technique specifically—just that we mutually agree on a technique for modeling the Pn.1 properties.
@steads: It makes me nervous to be modeling this using a vocabulary that is almost completely undocumented and still under development.
I'm also curious if there is a best-practices document, @steads, that describes how the CRM should be used. Your comment about Linguistic Objects makes sense, but is more restrictive than what is in the documentation for the CRM:
You would only instantiate an instance of E33 Linguistic Object if the text is documented in its own right as a subject in the domain of interest.
versus the documentation description:
This class comprises identifiable expressions in natural language or languages.
Is this your opinion as to what is best practices should be, or is this a restriction on the use of the CRM that is formally documented somewhere?
Clearly "Other" and "Unspecified" can't be modeled as parts, but are just as meaningless in any other structure too. The state of case open/case closed would be hard to do as parts, I agree, but is an outlier. The rest of them seem to be parts (but happy to be corrected if they're not)
Without X, means there is a part that is X and a part that is the rest of the object without X. So I would model that as:
_:Object has_part :X, :WithoutX .
_:WithoutX width _:WidthForObject ; height _:HeightForObject .
The question that I don't think can be answered from the very useful pivot table is how many objects have more than one set of dimensions. If that number is low, then the majority of objects can simply have dimensions associated with them directly.
@steads: Suggestion to rename "has_domain" to "subject", has_range to "object", "has_range_literal" to "value". Reason: Domain and Range are the types of the subject and object in a triple, not these resources themselves.
@workergnome> curious if there is a best-practices document that describes how the CRM should be used
What Steve said is just common sense.
@azaroth42> "Other" and "Unspecified"
Yes, these values should just be skipped. But if you have two DimItemElemXrefID, say Image and Other, you still need to emit them as two Measurements.
The rest of them seem to be parts
Please read more carefully what I wrote. How about "Sight"? "Image/Sight"? How about "combination parts" like "Image/Sheet/Mount"? (we don't even know whether that is AND or OR)
is an outlier
Everything that doesn't conform to a theory is an outlier ;-)
Please read about CONA dimensions: https://share.getty.edu/pages/viewpage.action?spaceKey=ITSLODV&title=CONA+Dimensions:
I think that before modeling, you need to study the data more carefully. So I'm telling you from experience people, these are NOT parts
Image, Sheet and Mount are all parts (right?), so I would model that as:
_:Object hasPart _:ISM .
_:ISM hasPart _:Image, _:Sheet, _:Mount ;
height _:HeightForISM .
Or if the Image is part of the Sheet, the appropriate nesting of those two.
Can you point me to a definition of "Sight"? Is it that the measurements were done estimated by sight, rather than with a tool? Then yes, that requires a Measurement to express how the measurement was done, rather than what the measurement is of.
Re has_domain ... why not just use RDF reification? That seems to be what has been reinvented.
We don't even know whether "Image/Sheet/Mount" means these together (AND) or some of them (OR).
Yes, Sight means "by eye".
Re has_domain: yes, in BM & CONA we used reification, but the "CRM reification" kind, which is E13_Attribute_Assignment.
If you can't tell what it means, you can't model it. Unless you intend to ask Patricia to add aat:ImageAndOrSheetAndOrMount and just move the problem to someone else. We should find out what it means from the people who can answer the question.
How the information was obtained always requires reification of the relationship, so for the qualifiers I agree we need another node. That shouldn't complicate the general case however, otherwise we end up reifying everything for every object.
I would prefer a single method, even if the complications are rarely used, rather than a general case and a special case. Mostly because when using the data I will either have to look for both options, or I'll end up ignoring all the special cases.
Maybe that's unavoidable, but it certainly makes the data harder to use.
I would also prefer a single method! The end result is that you end up reifying everything into millions of E13_Attribute_Assignments (or preferably, just, rdf:Statements) so you can record who said it, when, and why. That makes the data unusably complicated and no one looks for anything, ignoring all the cases not just the special ones :(
The million dollar (or hopefully not quite) question is how special is the special case? Thankfully we have data: (294+11=) 395 / 32959, or 1.2%. So you make the lives of everyone more complicated in 99% of the cases, for that final 1%. The cost of doing the 1% differently outweighs the cost of doing 99% consistently with it, in my opinion.
In terms of usage, I like the notion of "Ask forgiveness, not permission". In other words, try the 99% way and only if that fails only try the 1% way(s). That gives you scalability for simple applications doing something and then later adding the special cases, rather than having only very sophisticated applications that can do anything at all.
I agree entirely with all of that. I think I'm trying to figure out is at what point in the pipeline we apply that simplification, necessarily throwing out information. I see the pipeline we're talking about as:
Raw information -> Data Model -> API -> Application -> End User
or, more concretely:
Institutional Data Dump which is transformed by Karma into CIDOC-CRM RDF Files which are loaded into a Triplestore, then SPARQLed into JSON-LD Entity documents which are read by the Browse application code, and transformed into AAC Browse Website which is read by All Y'all.
if the simplification happens at the information -> data model step, the following steps can be much simpler, but it means that the whole pipeline is used for that one process. I'm OK with that, but I think that the goals of the AAC are bigger than just the use case of the browse application.
If the simplification is happening at the API -> application level, it's a pain in the butt for developers to work with the complexity of the model, and nothing ever gets built.
If we add a level of indirection between the data model and the application and simplify the data at the Data Model -> API, we can provide a nice, concise access point to information, and still preserve the ability to extend the API as needs occur. It's overkill for any one project, but it's probably a "good practice" for the project as a whole.
We're vastly off topic now but ...
My preference is that the difference between internal data (e.g. in the TripleStore) and the published data (via the API) are as close as possible. Preferably the API is "all the information the client needs to use this resource in JSON-LD". If so, then exactly how the system maintains the information is irrelevant if the API is just a particular graph boundary.
Otherwise we need profile based content negotiation and to pick a default representation -- in other words, the client needs to say whether it wants the data post or pre transformation in to the API structure. I would anticipate that the default would be the API structure, as by definition it's more useful. And then I would anticipate no one really ever using the non API data ... so having the API be LOD would be good... and hence having the two be as close as possible.
@azaroth42 Your calculation "1.2%" is flawed, since you ignored some rows of my pivot, haven't seen other AAC museum data, and ignored the examples I gave from CONA and CCO. In cultural data, the exception is usually the rule.
I propose to model TMS "Element" as an extension prop crmx:P2_extent because from my experience with museum data, that cannot be modeled cleanly as "part". "Extent" is defined in CONA and LIDO, and is also used as a target for: material/technique/implement, contribution, subject.
How would you model this real example from CONA: "the St.Peters basilica has height of dome above street level = 138m"? Ignore "dome" (drop data) or model it as a part (wrong)? The dome is a part, but the measurement is of drum+dome, not dome alone.
What will you say about "pattern repeat", "laid lines" or "center back"? These are CRM Features not Parts, the first two are repeating features (CRM has no such concept), and there's no way to tell Features apart from other Elements... unless you include tons of specific coding in the mapping.
Schema does not have all the weird dimensions you'll find in museums, eg "die axis" and "o'clock" for coins, circumference vs diameter, etc. That's why you need CRM's Dimension
AAT <size/dimensions by unit> includes about 25 units.
QUDT includes about 800 units, including conversion rules. But it's focused on science/engineering and doesn't include all AAT units. I introduced QUDT to BM and it's used there: but again, museums record weird and wonderful things, and you won't find them all in QUDT.
Examples (some of these are not in AAT either, but we can add them through Patricia):
Of course, we can tie up these extra units into the QUDT framework (eg to state that Pixels is dimensionless). But so far I haven't seen a use case for calculations with dimensions. Of course, QUDT is not the end-all of scientific dimensions. Eg see http://ci.emse.fr/multidimensional-quantity/
Some analysis of NPGDimsParsedUpdate2May.xlsx:![image](https://cloud.githubusercontent.com/assets/536250/17510318/228f910e-5e27-11e6-9b1b-1ef40043725a.png)
Questions to @si-npg:
Observations
We use the following for JPGM (who also have TMS):![image](https://cloud.githubusercontent.com/assets/536250/17510662/8f89e5ba-5e28-11e6-9863-f374115b9f17.png)
Consider the dimension data about one object:
I propose to map it to this RDF Turtle (first two rows are shown):
Notes:
crmx:P2_extent
indicates what is being measured. We can easily replace it with the standard crm:P2_has_type, since there's only one "type" in this case. But in CONA/JPGM the same idea "extent" is also used for Subject and Agent Contrubition, that's why we thouht it's a good idea to make a sub-property.crmx:sort_order
would be necessary to reconstruct the display dimension. Since we're emitting it as a separate node, we can skip it@kateblanch @edgartdata @steads @azaroth42 What do you think?