COMCIFS / cif_core

The IUCr CIF core dictionary
14 stars 9 forks source link

The database category is not that extensible #76

Closed jamesrhester closed 4 years ago

jamesrhester commented 6 years ago

The core DATABASE and DATABASE_CODE categories are intended to provide a single entry per datablock, usually inserted into the datablock when it is ingested by the database and then included when provided by the database. A better design would be to have the database name as an enumerated data name, which allows simple extension to other databases not included in the current list.

Additionally, a structure may be present in multiple repositories. While including information for multiple data blocks is somewhat difficult in the single crystal context (and unnecessary), in some contexts (for example topological databases) the particular structure may be unambiguously present in multiple collections. Therefore, this category should ideally be looped. I suggest a new category (to avoid the current category, which is to be considered "authoritative")

loop_
 _database_occurrence.id    #identifier
 _database_occurrence.name        # CSD,Bilbao,COD,ICSD etc.
 _database_occurrence.reference  # database-specific code
merkys commented 6 years ago

I support this extension.

vaitkus commented 6 years ago

This would be a very beneficial addition to the dictionary. However, I would rather redefine the entries as being related to the entry in a specific database rather than treating them as occurrences of the same entry in different databases. What I mean, is that CIF files derived from the same publication will be totally different in the COD, Bilbao and the CSD due to the different data curation strategies (i.e. the magnetic properties of the structures are exquisitely well described in the Bilbao, but might fall a bit short in other crystallographic databases). What is more, sometimes it might be wanted to establish relationships to non-crystallographic databases describing the crystallized compound.

Actually, we have developed a similar looped category in our COD/ROD dictionaries:

loop_
_rod_related_entry.id          #unique identifier
_rod_related_entry.database    #database name
_rod_related_entry.code        #unique identifier in the related database
_rod_related_entry.description #a human readable description of the relationship
_rod_related_entry.uri         #an uri pointing to the related resource

Definitions of data item mentioned above can be found in the 'cif_rod' DDLm dictionary.

Would something similar be useful in the core dictionary?

jamesrhester commented 6 years ago

Sorry for not commenting for so long. I agree that related is better. I wonder if an additional data name could be used to capture machine-readable relationships, e.g. _database_related_entry.relation could be identical , common source, etc.
If we are agreed, I will put a proposal to the core_cif mailing list.

vaitkus commented 6 years ago

Adding a machine-readable _database_related_entry.relation data item is definitely a good idea. Actually, we have have implemented something similar in the 'cif_rod' dictionary version 0.1.2 just a few days prior to Your post -- and even chose a very similar data name (_rod_related_entry.relation).

Are You thinking about an enumerator based approach or something completely different?

In our case, the data item was implemented as an enumerator with a set of values that fit the specific needs of the Raman Open Database (xrd_cell_best_match, xrd_cell_match, other). These values clearly do not belong in the core dictionary and I imagine that the need to define database-specific relationships will be a quite common among other users as well. One solution to this problem is for the prefix holder to define a prefixed version of the _database_related_entry.relation belonging to the DATABASE_RELATED category and list the desired values. For example:

loop_
 _database_related.id         # identifier
 _database_related.name       # CSD,Bilbao,COD,ICSD etc.
 _database_related.reference  # database-specific code
_rod_database_related.relation
1 COD 1000000 xrd_cell_best_match

Are definitions like that allowed under DDLm?

jamesrhester commented 6 years ago

Yes, I was thinking about an enumerator based approach. I don't think your suggested database-specific relationship will work in the CIF framework. However, it is always open to particular dictionaries to define an additional dataname for the _database_related category that would capture their specific additional relationships.

As I think we are agreed, I will put a proposal forward to the cif_core group.

vaitkus commented 6 years ago

However, it is always open to particular dictionaries to define an additional data name for the _database_related category that would capture their specific additional relationships.

Great, that is exactly the approach I was suggesting -- those database-specific relationships should only be defined in the dictionaries maintained by the prefix holders and not in the core dictionary.

As I think we are agreed, I will put a proposal forward to the cif_core group.

Wonderful, thank You.

jamesrhester commented 6 years ago

The core CIF group were very quiet on this one, so I am now going to go ahead and prepare formal definitions.

jamesrhester commented 6 years ago

Please see below some draft definitions for a new database_related category. If any databases have been left off the initial list below, feel free to suggest additions. Also, if there are some reasonably generic relations that I have omitted for _database_related.relation , they could also be added.

Note that I have chosen not to make these datanames aliases of the DATABASE_2 datanames in mmCIF, as the new category has a different key.

#
#  Draft definitions for a new DATABASE_RELATED category
#

save_DATABASE_RELATED
_definition.id          DATABASE_RELATED
_definition.class       Loop
_definition.scope       Category
_definition.update      2018-06-29
_description.text
;

    A category of items recording entries in databases that describe
    the same or related data. Databases wishing to insert their own
    canonical codes when archiving and delivering data blocks should
    use items from the DATABASE category.

;
_name.category_id       PUBLICATION
_name.object_id         DATABASE_RELATED
_category_key.name      '_database_related.id'
save_

save_database_related.id
_definition.id          '_database_related.id'
_definition.update      2018-06-29
_description.text
;
       An identifer for this database reference
;
_name.category_id       database_related
_name.object_id         id
_type.purpose           Key
_type.source            Recorded
_type.container         Single
_type.contents          Text
save_

save_database_related.database_id
_definition.id          '_database_related.database_id'
_definition.update      2018-06-29
_description.text
;
       An identifier for the database that contains the
       related dataset.
;
_name.category_id       database_related
_name.object_id         database_id
_type.purpose           State
_type.source            Recorded
_type.container         Single
_type.contents          Text
_import.get [{'save':database_list 'file':templ_enum.cif}]
save_

save_database_related.database_code
_definition.id          '_database_related.database_code'
_definition.update      2018-06-29
_description.text
;
       The code used by the database referred to in
       _database_related.database_id to identify the
       related dataset.
;
_name.category_id       database_related
_name.object_id         database_code
_type.purpose           Encode
_type.source            Recorded
_type.container         Single
_type.contents          Text

save_

save_database_related.relation
_definition.id          '_database_related.relation'
_definition.update      2018-06-29
_description.text
;
       The general relationship of the data in the data block
       to the dataset referred to in the database.
;
_name.category_id       database_related
_name.object_id         relation
_type.purpose           State
_type.source            Recorded
_type.container         Single
_type.contents          Text
loop_
   _enumeration_set.state
   _enumeration_set.details
   Identical           'The dataset contents are identical'
   Subset              'The dataset contents are a proper subset of the contents of the data block'
   Superset            'The dataset contents include the contents of the data block'
   Derived             'The dataset contents are derivable from the contents of the data block'
   Common              'The dataset contents share a common source'
save_

save_database_related.special_details
_definition.id          '_database_related.special_details'
_definition.update      2018-06-29
_description.text                      
;
    Information about the external dataset and relationship not encoded
    elsewhere.
;
_name.category_id                       database_related
_name.object_id                         special_details
_type.purpose                           Describe
_type.source                            Recorded
_type.container                         Single
_type.contents                          Text

save_

#
# Contents to be added to templ_enum.cif listing database codes
#

save_database_list
loop_
    _enumeration_set.state
    _enumeration_set.detail
    CAS          'Chemical Abstracts'
    COD          'Crystallographic Open Database'
    CSD          'Cambridge Structural Database'
    ICSD         'Inorganic Crystal Structure Database'
    MDF          'Metals Data File'
    NDB          'Nucleic Acid Database'
    PDB          'Protein Data Bank'
    PDF          'Powder Diffraction File (JCPDS/ICDD)'
    RCSB         'Research Collaboratory for STructural Bioinformatics'
    EBI          'European Bioinformatics Institute'
save_
vaitkus commented 6 years ago

Seems great at a first glance. A few notices, though:

1) The COD is expanded as "Crystallography Open Database", not "Crystallographic Open Database". It might not embody the best English grammar, but that's the historic name. 2) A usage example, i.e. placed in the category definition would be useful. 3) The data names _database_related.database_id and _database_related.database_code look very similar (the distinction between an id and a code is not initially clear). Renaming the _database_related.database_code to something that explicitly refer to the database entry (i.e. _database_related.entry_code or _database_related.database_entry_code) might help. Of course, this is just a suggestion.

jamesrhester commented 5 years ago

The latest version contains these definitions. I have chosen 'entry_code' as the alternative. Is anybody able to produce a realistic example to include in the category definition?

jamesrhester commented 4 years ago

I will replace this issue with an enhancement issue to provide an example for the category definition.