american-art / npg

National Portrait Gallery
Creative Commons Zero v1.0 Universal
1 stars 6 forks source link

NPGDims #54

Closed VladimirAlexiev closed 7 years ago

VladimirAlexiev commented 7 years ago

Some analysis of NPGDimsParsedUpdate2May.xlsx: image

Questions to @si-npg:

Observations

We use the following for JPGM (who also have TMS): image

Consider the dimension data about one object:

ObjectID DimensionID DimItemElemXrefID Element DimensionType ElementRank DimRank Dimension
23 349187 168874 Mat Height 3 1 71.12014224
23 349188 168874 Mat Width 3 2 55.88011176
23 213081 105668 Image Width 1 2 36.2
23 213082 105668 Image Height 1 1 44.1
23 213079 105667 Sheet Width 2 2 37.7
23 213080 105667 Sheet Height 2 1 50.2

I propose to map it to this RDF Turtle (first two rows are shown):

@base <http://americanartcollaborative.org/>.
@prefix crm:  <http://www.cidoc-crm.org/cidoc-crm/>.
@prefix crmx: <http://americanartcollaborative.org/crm-ext/>.
@prefix aat:  <http://vocab.getty.edu/aat/>.

<npg/object/23>
  crm:P43_has_dimension <npg/object/23/dimension>;
  crm:P39i_was_measured_by
    <npg/object/23/measurement/168874>, <npg/object/23/measurement/105668>, <npg/object/23/measurement/105667>.

<npg/object/23/dimension> a crm:E54_Dimension;
  crm:P3_has_note
    """Image: 44.1 x 36.2cm (17 3/8 x 14 1/4")  Sheet: 50.2 x 37.7cm (19 3/4 x 14 13/16")  Mat: 71.1 x 55.9cm (28 x 22")""".

<npg/object/23/measurement/168874> a crm:E16_Measurement; 
  crmx:P2_extent aat:300236006; # mat (framing and mounting equipment)
  crmx:sort_order 3; # ElemRank
  crm:P40_observed_dimension <npg/object/23/dimension/349187>, <npg/object/23/dimension/349188>.

<npg/object/23/dimension/349187> a crm:E54_Dimension;
  crm:P2_has_type aat:300055644; # height
  crm:P91_has_unit aat:300379098; # centimeters
  crm:P90_has_value 71.12014224;
  crmx:sort_order 1. # DimRank

<npg/object/23/dimension/349188> a crm:E54_Dimension;
  crm:P2_has_type aat:300055647; # width
  crm:P91_has_unit aat:300379098; # centimeters
  crm:P90_has_value 55.88011176;
  crmx:sort_order 2. # DimRank

Notes:

@kateblanch @edgartdata @steads @azaroth42 What do you think?

steads commented 7 years ago

The E54 Dimension instance you suggest for attaching the display string to, does not exist and should not be instantiated. The string is related to the object as a whole and should be attached directly to the instance of E22 Man-Made Object with P3 and P3.1. This can be instantiated using the CRMpc extension which gives a robust rdf deployment method for the .1 properties. I would prefer that the instances of E54 Dimension that are created for the width and height of the Mat had labels of the form "Object 23 width of Mat". I am unclear if the metric dimensions are genuine measurements or just mathematical conversions (the number of decimal places suggests conversion). If they are mathematical conversions then they should probably be dropped. There is definitely more than a single instance of E16 Measurement; there should be one for each dimension actually present. so in this case there are 6 or 12 instances of E16 Measurement depending on if the metric measurements are measurements or simply conversions. If they are conversions and you wish to have them represented in the data rather than just creating them on the fly in the user interface. then you woul have 2 instances of E54 Dimension connected by 2 instances of P40 to the same instance of E16 Measurement and add a P2 has type [conversion] to the converted value (in this case the metric value). I would probably amend the label to "Object 23 width of Mat (converted from Imperial)" as well.

workergnome commented 7 years ago

I am very curious about this CRMpc extension—is there any documentation on it that I can read? Google is failing me, as are both the search box at the new CIDOC site and the search at http://www.ics.forth.gr.

steads commented 7 years ago

Try http://new.cidoc-crm.org/technical_papers and then modelling properties of properties. There is a presentation and the RDF HTH

workergnome commented 7 years ago

Would it make sense to model it as a linguistic object, not just as a note?

:thing P129i_is_subject_of :dimension_string.
:dimension_string a E33_Linguistic_Object;
        p3_has_note [TEXT];
        P2_has_type aac:dimension_string.

It would also allow us to attach aboutness to the object.

I would also recommend looking at the http://qudt.org ontology for our units—the AAT definitions are good as terms, but they don't relate to anything that you'd need if you need to use the dimensions in any mathematical way, like unit conversion.

si-npg commented 7 years ago

Yes, linear dimensions are in Centimeters. Yes, the one Weight=3.63 is Kilograms.

steads commented 7 years ago

If you use P3 has note it captures the required sense. You would only instantiate an instance of E33 Linguistic Object if the text is documented in its own right as a subject in the domain of interest. So no, I am afraid, it does not make sense.

workergnome commented 7 years ago

So the suggestion for the textual representation of dimension is:

:thing crmpc:P01i_is_domain_of :note_property;
:note_property a crmpc:PC3_has_note;
    crmp:P03_has_range_literal "Image: 44.1 x 36.2cm...";
    crmp:P3.1_has_type aac:dimension_string.

That look right to people?

Will we use this mapping for all notes, or only specific notes, and how will we determine when to use this pattern or when to use the straight P3_has_note mapping?

azaroth42 commented 7 years ago

At least in the Provenance Index, and I anticipate in the Museum, we're going to simplify to:

_:Object a E22_Man_Made_Object ;
  schema:height [
    a E54_Dimension ;
    p90_value 71.12 ;
    p91_has_unit qudt:cm ] ;
  schema:width [
    a E54_Dimension ;
    p90_value 55.88 ;
    p91_has_unit qudt:cm ] .

For different parts of the object, I like the proposal that David made to model it as different parts of the object :) Much simpler, just as expressive, provides better hooks for future work.

VladimirAlexiev commented 7 years ago

Rob>model it as different parts

As you see in the pivot, not all Elements are Parts. Some express qualifier or mode.

schema:height

This modeling doesn't say what was measured. As I wrote "it's crucially important that DimItemElemXrefID groups measurements of the same (object,element). Emitting the dimensions without this grouping would be useless (same as in JPGM)". See https://share.getty.edu/display/JPGLODM/JPGM+Dimensions for how the data looks in TMS, and why it is necessary to group the dimensions.

And does schema have props for all dimensions required across museums?

Steve>unclear if the metric dimensions are genuine measurements or just mathematical conversions

The metric values are the only values we got in the database.

David> crmpc:P01i_is_domain_of crmpc:PC3_has_note crmp:P03_has_range_literal

Did you make up these terms? Or is there an RDF definition of "CRMpc"?

Also see this comment in #20: "it is one of the most expensive ways, since it doubles the number of classes and triples the number of property types. There are better ways to attach type to a relation."

VladimirAlexiev commented 7 years ago

not all Elements are Parts

To strengthen this: @azaroth42, can you please model "Without Base" as an object part ;-)

steads commented 7 years ago

See previous comment for link to RDF of CRMpc. The metric values are pretty obviously conversions and not real measurements (71.12014224!). The real measurements are then probably the Imperial measurements embedded in the text. How about parsing those?

steads commented 7 years ago

Just got the updated RDF for CRMpc 1.1 https://www.dropbox.com/s/o8w8juaoci3lzo9/CRMpc_v1.1.rdfs?dl=0

workergnome commented 7 years ago

@VladimirAlexiev: I'm not advocating for this technique specifically—just that we mutually agree on a technique for modeling the Pn.1 properties.

@steads: It makes me nervous to be modeling this using a vocabulary that is almost completely undocumented and still under development.

I'm also curious if there is a best-practices document, @steads, that describes how the CRM should be used. Your comment about Linguistic Objects makes sense, but is more restrictive than what is in the documentation for the CRM:

You would only instantiate an instance of E33 Linguistic Object if the text is documented in its own right as a subject in the domain of interest.

versus the documentation description:

This class comprises identifiable expressions in natural language or languages.

Is this your opinion as to what is best practices should be, or is this a restriction on the use of the CRM that is formally documented somewhere?

azaroth42 commented 7 years ago

Clearly "Other" and "Unspecified" can't be modeled as parts, but are just as meaningless in any other structure too. The state of case open/case closed would be hard to do as parts, I agree, but is an outlier. The rest of them seem to be parts (but happy to be corrected if they're not)

Without X, means there is a part that is X and a part that is the rest of the object without X. So I would model that as:

_:Object has_part :X, :WithoutX .
_:WithoutX width _:WidthForObject ; height _:HeightForObject .

The question that I don't think can be answered from the very useful pivot table is how many objects have more than one set of dimensions. If that number is low, then the majority of objects can simply have dimensions associated with them directly.

VladimirAlexiev commented 7 years ago

@steads: Suggestion to rename "has_domain" to "subject", has_range to "object", "has_range_literal" to "value". Reason: Domain and Range are the types of the subject and object in a triple, not these resources themselves.

@workergnome> curious if there is a best-practices document that describes how the CRM should be used

What Steve said is just common sense.

What are "Elements"

@azaroth42> "Other" and "Unspecified"

Yes, these values should just be skipped. But if you have two DimItemElemXrefID, say Image and Other, you still need to emit them as two Measurements.

The rest of them seem to be parts

Please read more carefully what I wrote. How about "Sight"? "Image/Sight"? How about "combination parts" like "Image/Sheet/Mount"? (we don't even know whether that is AND or OR)

is an outlier

Everything that doesn't conform to a theory is an outlier ;-)

Please read about CONA dimensions: https://share.getty.edu/pages/viewpage.action?spaceKey=ITSLODV&title=CONA+Dimensions:

I think that before modeling, you need to study the data more carefully. So I'm telling you from experience people, these are NOT parts

azaroth42 commented 7 years ago

Image, Sheet and Mount are all parts (right?), so I would model that as:

_:Object hasPart _:ISM .
_:ISM hasPart _:Image, _:Sheet, _:Mount ;
          height _:HeightForISM .

Or if the Image is part of the Sheet, the appropriate nesting of those two.

Can you point me to a definition of "Sight"? Is it that the measurements were done estimated by sight, rather than with a tool? Then yes, that requires a Measurement to express how the measurement was done, rather than what the measurement is of.

Re has_domain ... why not just use RDF reification? That seems to be what has been reinvented.

VladimirAlexiev commented 7 years ago

We don't even know whether "Image/Sheet/Mount" means these together (AND) or some of them (OR).

Yes, Sight means "by eye".

Re has_domain: yes, in BM & CONA we used reification, but the "CRM reification" kind, which is E13_Attribute_Assignment.

azaroth42 commented 7 years ago

If you can't tell what it means, you can't model it. Unless you intend to ask Patricia to add aat:ImageAndOrSheetAndOrMount and just move the problem to someone else. We should find out what it means from the people who can answer the question.

How the information was obtained always requires reification of the relationship, so for the qualifiers I agree we need another node. That shouldn't complicate the general case however, otherwise we end up reifying everything for every object.

workergnome commented 7 years ago

I would prefer a single method, even if the complications are rarely used, rather than a general case and a special case. Mostly because when using the data I will either have to look for both options, or I'll end up ignoring all the special cases.

Maybe that's unavoidable, but it certainly makes the data harder to use.

azaroth42 commented 7 years ago

I would also prefer a single method! The end result is that you end up reifying everything into millions of E13_Attribute_Assignments (or preferably, just, rdf:Statements) so you can record who said it, when, and why. That makes the data unusably complicated and no one looks for anything, ignoring all the cases not just the special ones :(

The million dollar (or hopefully not quite) question is how special is the special case? Thankfully we have data: (294+11=) 395 / 32959, or 1.2%. So you make the lives of everyone more complicated in 99% of the cases, for that final 1%. The cost of doing the 1% differently outweighs the cost of doing 99% consistently with it, in my opinion.

In terms of usage, I like the notion of "Ask forgiveness, not permission". In other words, try the 99% way and only if that fails only try the 1% way(s). That gives you scalability for simple applications doing something and then later adding the special cases, rather than having only very sophisticated applications that can do anything at all.

workergnome commented 7 years ago

I agree entirely with all of that. I think I'm trying to figure out is at what point in the pipeline we apply that simplification, necessarily throwing out information. I see the pipeline we're talking about as:

Raw information -> Data Model -> API -> Application -> End User

or, more concretely:


Institutional Data Dump which is transformed by Karma into CIDOC-CRM RDF Files which are loaded into a Triplestore, then SPARQLed into JSON-LD Entity documents which are read by the Browse application code, and transformed into AAC Browse Website which is read by All Y'all.


if the simplification happens at the information -> data model step, the following steps can be much simpler, but it means that the whole pipeline is used for that one process. I'm OK with that, but I think that the goals of the AAC are bigger than just the use case of the browse application.

If the simplification is happening at the API -> application level, it's a pain in the butt for developers to work with the complexity of the model, and nothing ever gets built.

If we add a level of indirection between the data model and the application and simplify the data at the Data Model -> API, we can provide a nice, concise access point to information, and still preserve the ability to extend the API as needs occur. It's overkill for any one project, but it's probably a "good practice" for the project as a whole.

azaroth42 commented 7 years ago

We're vastly off topic now but ...

My preference is that the difference between internal data (e.g. in the TripleStore) and the published data (via the API) are as close as possible. Preferably the API is "all the information the client needs to use this resource in JSON-LD". If so, then exactly how the system maintains the information is irrelevant if the API is just a particular graph boundary.

Otherwise we need profile based content negotiation and to pick a default representation -- in other words, the client needs to say whether it wants the data post or pre transformation in to the API structure. I would anticipate that the default would be the API structure, as by definition it's more useful. And then I would anticipate no one really ever using the non API data ... so having the API be LOD would be good... and hence having the two be as close as possible.

VladimirAlexiev commented 7 years ago

Elements

@azaroth42 Your calculation "1.2%" is flawed, since you ignored some rows of my pivot, haven't seen other AAC museum data, and ignored the examples I gave from CONA and CCO. In cultural data, the exception is usually the rule.

I propose to model TMS "Element" as an extension prop crmx:P2_extent because from my experience with museum data, that cannot be modeled cleanly as "part". "Extent" is defined in CONA and LIDO, and is also used as a target for: material/technique/implement, contribution, subject.

How would you model this real example from CONA: "the St.Peters basilica has height of dome above street level = 138m"? Ignore "dome" (drop data) or model it as a part (wrong)? The dome is a part, but the measurement is of drum+dome, not dome alone.

What will you say about "pattern repeat", "laid lines" or "center back"? These are CRM Features not Parts, the first two are repeating features (CRM has no such concept), and there's no way to tell Features apart from other Elements... unless you include tons of specific coding in the mapping.

Dimension Types

Schema does not have all the weird dimensions you'll find in museums, eg "die axis" and "o'clock" for coins, circumference vs diameter, etc. That's why you need CRM's Dimension

Dimension Units

AAT <size/dimensions by unit> includes about 25 units.

QUDT includes about 800 units, including conversion rules. But it's focused on science/engineering and doesn't include all AAT units. I introduced QUDT to BM and it's used there: but again, museums record weird and wonderful things, and you won't find them all in QUDT.

Examples (some of these are not in AAT either, but we can add them through Patricia):

Of course, we can tie up these extra units into the QUDT framework (eg to state that Pixels is dimensionless). But so far I haven't seen a use case for calculations with dimensions. Of course, QUDT is not the end-all of scientific dimensions. Eg see http://ci.emse.fr/multidimensional-quantity/