Ontologizing CIF data - Githubissues

emmo-repo / CIF-ontology

Basic crystallography domain ontology based on EMMO and the CIF core dictionary.

Creative Commons Attribution 4.0 International

9 stars 4 forks source link

Ontologizing CIF data #16

Closed CasperWA closed 3 years ago

CasperWA commented 3 years ago

In a discussion between myself, @jesper-friis and @emanueleghedini, we tried to tackle the hurdles for getting development on this ontology started in a practical way. In other words, we tried setting up a basic taxonomy and parthood graph for CIF data. By CIF data we mean the semantics of the actual data (the values) not the semantics of the associated keywords. However, since the values are represented by their keywords/data names, these have been used in the mock up.

The resulting graph is shown below. CIF EMMO

The graph has essentially two important parts. One pertains to the hierarchy of the data, the other relates to the semantics of the data types. For the hierarchy, we see that CIF_DATA has a part loop_. This is not to be taken as the syntactical loop_, but rather the concept of the CIF data expressed as loops. A loop has a part ROW, which is our attempt to define a collection of a single row of data within a loop. Note that we do not care here whether the CIF file syntactically defines a ROW using key+value lines or as part of a syntactic loop_. ROW encompasses both as the same concept semantically. Now we come to how one may practically extend the ontology. Here we have added the concept of _space_group_symop_[], which isA ROW and hasParts _space_group_symop_id, _space_group_symop_operation_xyz, and _space_group_symop_sg_id. There is a restriction of how many times _space_group_symop_[] can have each of these parts (max 1). Now, all of these are SPACE_GROUP_SYMOP, i.e., they are of the CIF category SPACE_GROUP_SYMOP.

For the data types, you can see that we have given each of the CIF data names types according to the type definitions of the CIF dictionary. In CIFv1 (which we are currently only concerned with) there are only three types: null, char, and numb (REF). Each of the three data types has been defined and also related to the general xsd types via the types defined in EMMO. This creates a data type relationship for all CIF data to that of EMMO.

Now if one wants to extend this, you would simply add another null type/category_overview CIF data as a sub-class of both ROW, cif:null, and its associated category, and afterwards add all its containing data keys/names as parts of it, sub-classing both the category (again) and the related type.

Finally, this can be automated by going through the actual .dic file, which defines all the relevant metadata for each data key/name (link to coreCIF .dic-file).

This is not meant to be the absolute way of ontologizing CIF data, but rather, it is our currently suggested way of doing it. This issue is meant to be a discussion of its validity and one can suggest or ask questions freely.

As an added bonus, I have created a branch where one can see the implementation of the graph above into a Turtle file in the current repository (cif-data). If you checkout this branch and open the Turtle file cif-data.ttl in Protégé, you should see the suggested implementation, which could act as a template for adding more CIF data keys/names (the added concepts are marked in bold).

CasperWA commented 3 years ago

@jesper-friis and @emanueleghedini I am not completely sure I implemented the hierarchy correctly in the example branch, please update it as you see fitting. Also, I tried implmenting the max 1 requirement, but to get the reasoner to work, I had to use hasSpatialDirectPart, as it's asymmetric. But whenever I saved it in the Turtle format and reloaded Protégé the parthoods were undone and were instead Errors?

jesper-friis commented 3 years ago

I have a few suggestions:

change language of prefLabel from danish to english : @da -> @en
use skos namespace for prefLabel : http://www.w3.org/2004/02/skos/core#prefLabel -> skos:prefLabel
don't import crystallography.ttl from cif-data.ttl. emmo must still be imported so you have hasSpatialDirectPart available. Let CIF be a subclass of emmo:Language
import cif-data.ttl from crystallography.ttl and make CIF a subclass of Crystallographical

jesper-friis commented 3 years ago

By the way, Protege behaved very strange for me when I moving around within the CIF branch. Not sure why. Could be the max 1 restrictions

jamesrhester commented 3 years ago

Here are some questions and comments, bearing in mind that I'm not that familiar with the terminology here:

What is the sense of 'row isA category' i.e. _space_group_symop_[] isA SPACE_GROUP_SYMOP? Wouldn't it make more sense to say that the category consists of many rows?
Is it true that a cif:char is always an identifier? Couldn't it just be a plain string? I would have thought that an identifier isA cif:char
It appears that the uniqueness of the category key (given by DDL1 tag list_uniqueness) is not captured.
Numb values in a CIF data file can have a standard uncertainty (su) appended. This is an annoying technical point which we have recently cleared up by requiring that a separate data name is defined for the su of any data name that can have an su. So you probably want to make clear at some point that a cif:numb can have two parts, one of which is an su.

As a comment: when describing CIF data you are describing a relational model. Some options for relational models include: a row-based description as here; a column-based description, in which case the values attached to the data names are always equal-length arrays of values; and a functional description.

The functional description is particularly interesting as it maps simply to mathematical category theory. In a functional description, each of the data names in a CIF category is the name of a function mapping from the key data names (which are not identified explicitly as such in DDL1, but the list_uniqueness DDL1 attribute tries to capture that meaning) to the domain of the CIF type. A particular data file is then an "instance" of these mappings in the same way as a database is an instance of the database schema. I do not know if emmo contains the vocabulary for describing things in this way, but it seems to me to be a compact way to describe what is happening. I touch on this in an open-access paper: https://datascience.codata.org/articles/10.5334/dsj-2016-012/ (where 'ologs' that are really thinly-disguised mathematical categories.) Some of the papers of Spivak cited in the above paper go into the database-category theory link more explicitly.

jesper-friis commented 3 years ago

It is clear that we have a lot to learn about CIF. I was not aware about the column and functional descriptions. I think it is should be possible to describe the data schema with EMMO, but we need your help to come up with a suggestion. Regarding your points:

So what you are suggesting is to say that the category is a loop_ (or table). This is consistent with your comment in PR #14. The taxonomy would then be purely according to type. I think that make sense.
You are right, not all cif:char are identifiers. It should simply be a subclass of emmo:String. In the next step, we should thing about how to further restrict their range, e.g. by more fine-grained subclassing.
This is beyond my understanding of cif. Can list_uniqueness be used explicitly in a cif file?
Good point. In EMMO we have emmo:MeasurementResult with parts emmo:MeasuredQuantitativeProperty and emmo:MeasuredUncertainty. However, these are quantities (with a unit) while cif:numb is just two numbers. Does CIF schema provide a way to specify units? Or are they only specified in the comments?

jesper-friis commented 3 years ago

Here is an updated figure including the points above, except for capturing list_uniqueness

jamesrhester commented 3 years ago

This is beyond my understanding of cif. Can list_uniqueness be used explicitly in a cif file?

_list_uniqueness is specified in the dictionary to identify items that must have unique values for each row of a loop (so exactly the same as keys in relational databases). It does not appear in a cif file, and is mostly useful when validating.

Good point. In EMMO we have emmo:MeasurementResult with parts emmo:MeasuredQuantitativeProperty and emmo:MeasuredUncertainty. However, these are quantities (with a unit) while cif:numb is just two numbers. Does CIF schema provide a way to specify units? Or are they only specified in the comments?

Units are specified using the _units attribute in a definition.

One query about the updated schema, which looks cleaner: why is there any need for loop_ to be present and to have a type? What is lost by dropping it completely?

jesper-friis commented 3 years ago

Thank you for the clearification about _list_uniqueness .

I don't see any reference to _unit in the description of e.g. the _cell_length_a tag. Is it supposed to be somewhere else?

Regarding loop_. There is a difference between a loop or table, like SPACE_GROUP_SYMOP, and categories (rows), like _space_group_symop_[]. I therefore think that it make sense to have a layer between cif:null and these in the taxonomy. Would it make sense to introduce CATEGORY as a subclass of cif:null which has all tags of the form _*_[] as subclasses? loop_ and CATEGORY would then be siblings in the taxonomy.

jamesrhester commented 3 years ago

One thing I think started to come out in the meeting was the distinction between syntax and underlying semantics. So I think the original EMMO goal might have been to capture the contents of a CIF syntax data file, in which case loop_ and rows and data names and data values are all syntactical objects. The data file contents are useless however unless you know what each data name refers to (you can't even reliably tell the difference between a numerical value and a string value from the syntax) and for that you need the CIF dictionaries.

The dictionaries aim to be syntax-agnostic, instead modelling data as relational tables. As I said, all that you need from a format for this to be possible is that a vector of values can be associated with some object or abstract location in data files, and an association between that location and a CIF dataname. Collection of these vectors into tables is specified by the CIF dictionaries, so in this sense CIF loops are redundant.

But back to the question at hand, a 'category' is a CIF semantic concept and a 'loop' is a CIF syntactic concept, so it seems strange to mix them in a single scheme. Maybe what we could do is clearly delineate syntax from semantics, perhaps by providing two ontologies: this description of CIF syntax in the 'language' sub-area of EMMO, and a separate branch of the emmo ontology that transcribes the contents of CIF dictionaries. So for the syntax, you'd say that a CIF data file contains data blocks, which contain key-value pairs and loops. Loops contain rows of values each associated with a data name. At this point all values are just strings, and data names are generic.

Maybe it would then be possible (don't know how EMMO works yet) to state that a data name in the syntax is a data name found in the dictionary ontology, thus making the link between syntax and semantics.

On the other hand, it looks like it might be possible to express the two separate ontologies (syntax and semantics) in a single ontology, but then I think we should distinguish carefully between syntactical types and semantic types. So a data name isa "data name appearing in a loop in a file" and also isa "thing defined in a dictionary" belonging to a "dictionary category" and its value has syntactic type "string" and semantic type "whatever the dictionary says".

jamesrhester commented 3 years ago

Some thoughts ahead of tonight's meeting - I think the simplest way of incorporating CIF dictionary information is to choose which of the attributes attached to a data name are relevant, and then they each become a box with an arrow pointing to them from the data name. In the simplest version, you care only about the type of the value, which is about what we currently have. Likewise any relevant attributes of the category can also be attached to the category name. I think this should solve at least the issue of how to map the dictionary + syntax.

jesper-friis commented 3 years ago

Here is yet another iteration, trying to go for simplicity as you suggest. Not sure whether we need the name, saw it in the ddl.

I removed the leading underscore because it seems that Protege don't like them. We can discuss how to deal with them.

jamesrhester commented 3 years ago

So I think v3 looks neater and more flexible in terms of adding additional attributes to data names when/if they become important. Some comments:

A category is like a classification for a given table. I'm not sure the 'spatialDirectPart' is the correct way of describing this
Definitely no need to use _name, it is just giving a name to the category which is already explicit in the diagram
The old DDL1 dictionary files are use a pretty primitive ontology language. The new DDLm dictionary files are more rigorous. It would be better to use terminology from them
Categories shouldn't have _units, only data names
Some 'char' data names have values that are restricted to a particular list. That seems to me to be relatively important information that is worth capturing.

jesper-friis commented 3 years ago

Here is v4 after a discussion with Emanuele and looking back on the comments by James: The focus is on the semantics and how CIF data is structured rather than of the syntax.

I have also created a new github branch cif-data-v4 with the following:

cif_top.ttl: an small ontology implementing everything above the dotted line in the figure
cif_example.ttl: an example ontology implementing the classes below the dotted line
generate_cif.py: a python script that downloads cif_core.dic and generate cif_core.ttl from it

Some notes:

_space_group_symop_id, etc are now subclasses of DATA_ITEM. In this version there is no distinction between DATA_ITEM and _type. Would DATA_NAME be a better term for this?
Category is now a classification using isA. The _space_group_symop_TABLE hasSpatialPart only SPACE_GROUP_SYMOP restriction ensures that _space_group_symop table can only contain data items belonging to the SPACE_GROUP_SYMOP category. In the cif_example.ttl SPACE_GROUP_SYMOP is implemented as a disjoint union of its data items, hence the isA relations between _space_group_symop.id, etc... and the SPACE_GROUP_SYMOP are inferred by the reasoner.
We reintroduced the ROW that fall out in v3. As noted by James, table/loop_ constructs may not only be a list of rows, so I made ROW a subclass of DATA_ITEMS - please suggest a better naming.
_name and _units are now annotation properties of DATA_ITEM
Looking at DDLm - this is a great simplification. I we have to dig into it and try to refine and subclass the DDL_CONCEPTS. Especially the types needs to be updated. At the moment generate_cif.py does not assign types, since the types referred to in cif_core.dic are very different from what is shown here.
generate_cif.py utilises some features implemented in EMMO-python PR #144. Before it is merged into master and a new release is created you have to check out this branch.

jesper-friis commented 3 years ago

@jamesrhester, it seems that you might know PyCIFRW pretty well. How do we access information about type and unit for data names like cell.length_a when reading cif_core.dic? When trying the following:

>>> from CifFile import ReadCif
>>> cf = ReadCif('cif_core.dic')
>>> cf.get_children('core_dic')['cell.length_a'].items()
[('_definition.id', '_cell.length_a'),
 ('_alias.definition_id', ['_cell_length_a']),
 ('_import.get', [{'save': 'cell_length', 'file': 'templ_attr.cif'}]),
 ('_name.category_id', 'cell'),
 ('_name.object_id', 'length_a')]

it seems that we somehow are supposed to import templ_attr.cif and obtain the missing information from there. Is that something PyCIFRW can do for us?

jamesrhester commented 3 years ago

Indeed I do know it pretty well! See the original paper. The CifDic object provides easy access to dictionaries and will automatically import the auxilliary files like "templ_attr.cif" if they are in the same directory as the dictionary. So:

In [1]: from CifFile import CifDic

In [2]: p = CifDic("/home/jrh/COMCIFS/cif_core/cif_core.dic",do_dREL=False)
# lots of output edited out...
In [3]: p["_space_group_symop.id"]["_type.contents"]                             
Out[3]: 'Integer'

Attributes of CIF categories can also be found in the same way:

In [4]: p["space_group_symop"]["_category_key.name"]
Out[4]: ['_space_group_symop.id']

Be sure to use the do_dREL=False option as otherwise PyCIFRW will do a lot of unnecessary extra work. I think I should change the default to False.

Note that if an attribute appears in a loop (even if it only has one row) the return value will be an array as in the last example.

Most (all?) current small molecule CIF data files use the old-style data names that don't have a period character . in the data name. These will be available in the _alias.definition.id loop:

In [5]: p["_space_group_symop.operation_xyz"]["_alias.definition_id"]
Out[5]: 
['_space_group_symop_operation_xyz',
 '_symmetry_equiv.pos_as_xyz',
 '_symmetry_equiv_pos_as_xyz']

jesper-friis commented 3 years ago

Thank you James. CifDic seems to be exactly what I was looking for. Wonderful to work with the main developer!

jesper-friis commented 3 years ago

Do you have any comments about v4 above: https://github.com/emmo-repo/domain-crystallography/issues/16#issuecomment-803586344?

jamesrhester commented 3 years ago

v4 looks pretty good. Some thoughts:

In what sense is a FUNCTION a Data_Items? (And what is FUNCTION here?)
I agree that the relation ROW hasSpatialDirectPart DataItem is sensible. Likewise a COLUMN can be split into individual data name - data value pairs so the COLUMN hasSpatialDirectPart DataItem is sensible too.
For interpretation of any given data item it is important to know the values of the "key" data names in the row that it came from. For example, an atomic position by itself is meaningless without the name of the atom that appears in the same row. I'm not sure if this should be captured.
SPACE_GROUP_SYMOP here is what we discussed at the last meeting. I thought this could be an "abstract data name belonging to the SPACE_GROUP_SYMOP category", and you could imagine that the information contained in the CIF category definition is just listing the generic properties that are true for all data names in that category, most importantly the key data names (so analogous to a superclass in object-oriented programming). However, I don't see how the relationship space_group_symop_id isA SPACE_GROUP_SYMOP isA CATEGORY can possibly make sense as it implies that space_group_symop_id isA CATEGORY. I do see the issue with the relation SPACE_GROUP_SYMOP isA DATA_ITEM, because SPACE_GROUP_SYMOP doesn't have units or a value, but I don't have a solution at the moment. It seems the missing information is that a data item "belongs to" a single category or perhaps a category hasSpatialPart "data item"??

jesper-friis commented 3 years ago

The subclassing into FUNCTION, ROW and COLUMN is based on one of your comments above:

As a comment: when describing CIF data you are describing a relational model. Some options for relational models include: a row-based description as here; a column-based description, in which case the values attached to the data names are always equal-length arrays of values; and a functional description.

I think this is one of the things we should discuss further on Friday.
Great
It is easy to require that all atom_site_ROW must have exactly one atom_site.label as spatial direct part. It is also possible to make more complex logical constructs like requiring that a row must have a label if it also has a position. The question is whether this kind of information is expressed in the cif_core dictionary? If not, the simplest would be to create a new main turtle file that imports the generated file and add such conditions. Lets take that in the next step.
We can definitely discuss this further, but if we want that the properties specified for a category should apply to all data items of that category, then subclassing is the obvious way to go. It is also the standard way in ontologies to express categorisation. What we will end up with is that a _space_group_symop.id is both a DATA_ITEM and a CATEGORY (that is the intersection of them).

We can emphasise that _space_group_symop.id is a DATA_ITEM by explicitly asserting that in the ontology. By defining SPACE_GROUP_SYMOP as the disjoint union of _space_group_symop.id, _space_group_symop.R, ... it is first after reasoning we will see that _space_group_symop.id, ... are also subclasses of CATEGORY.

Regarding your comment about aliases. They are already included as skos:altLabel in the generated cif_core.ttl file.

jesper-friis commented 3 years ago

By the way, here is the generated cif_core turtle file if you want to explore it in Protege.

cif_core.zip

jamesrhester commented 3 years ago

Regarding FUNCTION: that comment was theoretical and not CIF-specific. I thought it might provoke some ideas, but it is probably not relevant now. Just for interest, what I had in mind was that instead of the relational model it is possible to conceptualise tables as follows: say you have a table with column headings x,y,f,g,h , where x,y are the "key columns", that is, the values in these columns can be used to uniquely identify ("key into") a row. In this example, f,g,h can be considered functions of x and y, and the values in each column are just the values of those functions for the x,y in each row. x,y,f,g,h are all data names in the CIF world i.e. a data name isA function name.
Good
The information about key data names is in the category definition under _category_key.name. The logic is that all rows must include values for these data names, and the combination of their values must be unique. Categories with only one row do not normally have these specified.
Sounds fine. As long as the logical reasoning works out properly!

jesper-friis commented 3 years ago

Thank you @jamesrhester. I made yet an update where I introduced type contents and containers from DDLm.

I have also updated the script for generating the cif_core ontology. Francesca has released version 1.0.0 of EMMO Python, so generating the ontology can now be done with the following steps:

$ pip install PyCifRW EMMO
$ git fetch origin cif-data-v4
$ git checkout cif-data-v4
$ python generate_cif.py

Some notes and questions:

The type contents are subclasses of Single, while the List, Array and Matrix containers have Single as spatial parts.
Does it make sense to generate more specific types as needed, like the Shape3x3RealMatrix and Shape3RealMatrix in the figure? The alternative would simply to say that space_group_symop.R is a Matrix. In that case we should include _type as an annotation attribute such that it is a real matrix is not lost (I already added _dimension as a annotation attribute).
With the containers, I have a feeling that we are almost redoing the TABLE, but I am not sure since the CIF files I have been working with only uses data names that are of type Single. How are containers like Array, List, Matrix and Table serialised in CIF?
Would it make sense to change the

ROW hasSpatialDirectPart some DATA_VALUE COLUMN hasSpatialDirectPart some DATA_VALUE

to

ROW hasSpatialDirectPart some Single COLUMN hasSpatialDirectPart some Single

and replace 'CIF_DATA_BLOCK hasSpatialDirectPart some TABLE' with 'CIF_DATA_BLOCK hasSpatialDirectPart some TABLE or DATA_VALUE'?
There are both a container type and a content type named "Implied". I am not sure how to use it, so for now I omitted the Implied container. Is that ok?
I think that the TABLE, DATAITEMS, ROW, COLUMN part now is the weakest part of our ontologisation. Firstly are these classes invented by me based on my unsteady understanding of the loop construct in CIF. I think they should be more closely named after DDLm. Secondly is DATA_ITEMS a very bad name (best practices for naming classes on an ontology should is that they should be singular nouns). DATA_GROUP would be better, but I guess that the CIF community already has a name for the underlying concept that we should use. Since "Table" is already a name for a container type, I also think it is a good idea to find another name for that. Thoughts?
Where is COLUMN used? Currently is is completely unused by the script. The script uses whether a data item defines _definition.class to determine whether it is a category or data value. Is that the right way to do it? Can we use the value of the _definition.class to something?
Regarding point 3 above, we can enforce that rows must include the data values listed under _category_key.name by changing the max 1 cardinality restriction to exactly 1. Should we implement that? This does not cover the requirement that their values must be unique.

jamesrhester commented 3 years ago

The latest ontology looks good and I don't have any specific comments beyond what is below.

The type contents are subclasses of Single, while the List, Array and Matrix containers have Single as spatial parts.

That makes sense

Does it make sense to generate more specific types as needed, like the Shape3x3RealMatrix and Shape3RealMatrix in the figure? The alternative would simply to say that space_group_symop.R is a Matrix. In that case we should include _type as an annotation attribute such that it is a real matrix is not lost (I already added _dimension as a annotation attribute).

In practice there are not that many distinct matrix types in the core dictionary so you could generate specific types. Composite and incommensurate structures, which eventually you'll want to capture, can have a bit more variety in dimensions. I don't think there is a right answer here. I do like the way the ontology looks with separate types for each type of matrix.

With the containers, I have a feeling that we are almost redoing the TABLE, but I am not sure since the CIF files I have been working with only uses data names that are of type Single. How are containers like Array, List, Matrix and Table serialised in CIF?

All of the compound data types are "new" and very rarely if ever appear in data files currently, as they require use of the new CIF syntax that supports them.

Syntactically, an array/matrix/list looks like: [1 2 3 4 5] A list is an array where the type of the entries can differ. I am keen to remove these from the dictionary as the differing types imply differing meanings and therefore that they should have distinct data names. A matrix is a 1- or 2-dimensional array that follows the rules of matrix multiplication. A 1-dimensional Matrix is a Vector. From the point of view of syntax they are identical to arrays of arrays.

A table is written {"a":1 "b":2}. If you search for _import.get in cif_core.dic you will see an example of a table inside a single-element list. A table is definitely not to be confused with a CIF loop.

Would it make sense to change the
ROW hasSpatialDirectPart some DATA_VALUE
COLUMN hasSpatialDirectPart some DATA_VALUE
to
ROW hasSpatialDirectPart some Single
COLUMN hasSpatialDirectPart some Single
and replace 'CIF_DATA_BLOCK hasSpatialDirectPart some TABLE' with 'CIF_DATA_BLOCK hasSpatialDirectPart some TABLE or DATA_VALUE'?

Definitely not. A table constructed as {"key":value ...} is a type of data value that may appear in the columns of a CIF loop. In general compound data types may appear as entries in a CIF loop.

There are both a container type and a content type named "Implied". I am not sure how to use it, so for now I omitted the Implied container. Is that ok?

Yes it is. This is almost never used in domain dictionaries, and where it does occur it means we haven't cleaned up properly. It is mainly used in the dictionary describing the attributes themselves where the type of an attribute depends on the type of another attribute.

I think that the TABLE, DATAITEMS, ROW, COLUMN part now is the weakest part of our ontologisation. Firstly are these classes invented by me based on my unsteady understanding of the loop construct in CIF. I think they should be more closely named after DDLm. Secondly is DATA_ITEMS a very bad name (best practices for naming classes on an ontology should is that they should be singular nouns). DATA_GROUP would be better, but I guess that the CIF community already has a name for the underlying concept that we should use. Since "Table" is already a name for a container type, I also think it is a good idea to find another name for that. Thoughts?

I think the general idea behind TABLE, DATA_ITEMS, ROW, COLUMN is correct. In terms of naming I don't think we have any particular concept, as we are table-oriented. Being table-oriented means that we have column headings (the data names) and columns that are lists of values, or alternatively we have rows, and for each row we can associate a data name with a data value. You may wish to rename TABLE to LOOP. If you adopt the row-based view, then for each row there is a set (i.e. unordered) of data items, where each data item is a data name - data value pair. Note that each row is completely independent of every other row. The order of rows can be changed with absolutely no change in meaning - don't know if that helps.

Where is COLUMN used? Currently is is completely unused by the script. The script uses whether a data item defines _definition.class to determine whether it is a category or data value. Is that the right way to do it? Can we use the value of the _definition.class to something?

_definition.class should be useful. If it is blank or Datum, it is a data name definition. If it is Loop, more than one row can be present in the loop corresponding to this category. If it is Set, only a single row can be present for this category (and the data names can therefore appear as key-value pairs in the file instead of a syntactic loop - but I think that information doesn't need to be captured).

COLUMN as a concept is useful if you consider that the value attached to a data name in a file is actually a whole column of values. This is efficient for programming but perhaps sets of rows makes more ontological sense. The dictionary does not and cannot make use of COLUMN anywhere so you can leave it out if you want.

Regarding point 3 above, we can enforce that rows must include the data values listed under _category_key.name by changing the max 1 cardinality restriction to exactly 1. Should we implement that? This does not cover the requirement that their values must be unique.

Yes, that would be OK, as the data names listed under _category_key.name must appear in the loop corresponding to their category if there is more than one row. However, note that if there is only one row they can take arbitrary values if absent. Such a situation doesn't occur in practice for inorganic structures, so could be ignored, but the macromolecular people (who also use CIF) make some use of this shortcut.

jesper-friis commented 3 years ago

Thank you James for useful comments. I have now implemented the specific data types in point 2 above. I have to return to the other points.

Ideally, we should also link the array elements to their corresponding arrays, like explicit stating that atom_sites_Cartn_transform.mat_11 is the first component of atom_sites_Cartn_transform.matrix. However, as far as I can see, there is no such link in the cif_core dictionary, so I guess this would have to be based on heuristics.

I have also started to connect the generated cif_core ontology with our original EMMO crystallography domain ontology.

jamesrhester commented 3 years ago

That's right, there is no attribute for explicitly linking the elements of a matrix and the matrix itself. The precise relationship is expressed in the dREL code for constructing the matrix from the elements (e.g. here)

CasperWA commented 3 years ago

While this issue has been closed, it can be used as reference for further changes, but it would be better to open specific GitHub issues pertaining to the subject of the suggested change.