Closed jamesrhester closed 4 years ago
I support this extension.
This would be a very beneficial addition to the dictionary. However, I would rather redefine the entries as being related to the entry in a specific database rather than treating them as occurrences of the same entry in different databases. What I mean, is that CIF files derived from the same publication will be totally different in the COD, Bilbao and the CSD due to the different data curation strategies (i.e. the magnetic properties of the structures are exquisitely well described in the Bilbao, but might fall a bit short in other crystallographic databases). What is more, sometimes it might be wanted to establish relationships to non-crystallographic databases describing the crystallized compound.
Actually, we have developed a similar looped category in our COD/ROD dictionaries:
loop_
_rod_related_entry.id #unique identifier
_rod_related_entry.database #database name
_rod_related_entry.code #unique identifier in the related database
_rod_related_entry.description #a human readable description of the relationship
_rod_related_entry.uri #an uri pointing to the related resource
Definitions of data item mentioned above can be found in the 'cif_rod' DDLm
dictionary.
Would something similar be useful in the core dictionary?
Sorry for not commenting for so long. I agree that related
is better. I wonder if an additional data name could be used to capture machine-readable relationships, e.g. _database_related_entry.relation
could be identical
, common source
, etc.
If we are agreed, I will put a proposal to the core_cif mailing list.
Adding a machine-readable _database_related_entry.relation
data item is definitely a good idea. Actually, we have have implemented something similar in the 'cif_rod' dictionary version 0.1.2 just a few days prior to Your post -- and even chose a very similar data name (_rod_related_entry.relation
).
Are You thinking about an enumerator based approach or something completely different?
In our case, the data item was implemented as an enumerator with a set of values that fit the specific needs of the Raman Open Database (xrd_cell_best_match
, xrd_cell_match
, other
). These values clearly do not belong in the core dictionary and I imagine that the need to define database-specific relationships will be a quite common among other users as well. One solution to this problem is for the prefix holder to define a prefixed version of the _database_related_entry.relation belonging to the DATABASE_RELATED category and list the desired values. For example:
loop_
_database_related.id # identifier
_database_related.name # CSD,Bilbao,COD,ICSD etc.
_database_related.reference # database-specific code
_rod_database_related.relation
1 COD 1000000 xrd_cell_best_match
Are definitions like that allowed under DDLm?
Yes, I was thinking about an enumerator based approach. I don't think your suggested database-specific relationship will work in the CIF framework. However, it is always open to particular dictionaries to define an additional dataname for the _database_related
category that would capture their specific additional relationships.
As I think we are agreed, I will put a proposal forward to the cif_core group.
However, it is always open to particular dictionaries to define an additional data name for the _database_related category that would capture their specific additional relationships.
Great, that is exactly the approach I was suggesting -- those database-specific relationships should only be defined in the dictionaries maintained by the prefix holders and not in the core dictionary.
As I think we are agreed, I will put a proposal forward to the cif_core group.
Wonderful, thank You.
The core CIF group were very quiet on this one, so I am now going to go ahead and prepare formal definitions.
Please see below some draft definitions for a new database_related category. If any databases have been left off the initial list below, feel free to suggest additions. Also, if there are some reasonably generic relations that I have omitted for _database_related.relation
, they could also be added.
Note that I have chosen not to make these datanames aliases of the DATABASE_2 datanames in mmCIF, as the new category has a different key.
#
# Draft definitions for a new DATABASE_RELATED category
#
save_DATABASE_RELATED
_definition.id DATABASE_RELATED
_definition.class Loop
_definition.scope Category
_definition.update 2018-06-29
_description.text
;
A category of items recording entries in databases that describe
the same or related data. Databases wishing to insert their own
canonical codes when archiving and delivering data blocks should
use items from the DATABASE category.
;
_name.category_id PUBLICATION
_name.object_id DATABASE_RELATED
_category_key.name '_database_related.id'
save_
save_database_related.id
_definition.id '_database_related.id'
_definition.update 2018-06-29
_description.text
;
An identifer for this database reference
;
_name.category_id database_related
_name.object_id id
_type.purpose Key
_type.source Recorded
_type.container Single
_type.contents Text
save_
save_database_related.database_id
_definition.id '_database_related.database_id'
_definition.update 2018-06-29
_description.text
;
An identifier for the database that contains the
related dataset.
;
_name.category_id database_related
_name.object_id database_id
_type.purpose State
_type.source Recorded
_type.container Single
_type.contents Text
_import.get [{'save':database_list 'file':templ_enum.cif}]
save_
save_database_related.database_code
_definition.id '_database_related.database_code'
_definition.update 2018-06-29
_description.text
;
The code used by the database referred to in
_database_related.database_id to identify the
related dataset.
;
_name.category_id database_related
_name.object_id database_code
_type.purpose Encode
_type.source Recorded
_type.container Single
_type.contents Text
save_
save_database_related.relation
_definition.id '_database_related.relation'
_definition.update 2018-06-29
_description.text
;
The general relationship of the data in the data block
to the dataset referred to in the database.
;
_name.category_id database_related
_name.object_id relation
_type.purpose State
_type.source Recorded
_type.container Single
_type.contents Text
loop_
_enumeration_set.state
_enumeration_set.details
Identical 'The dataset contents are identical'
Subset 'The dataset contents are a proper subset of the contents of the data block'
Superset 'The dataset contents include the contents of the data block'
Derived 'The dataset contents are derivable from the contents of the data block'
Common 'The dataset contents share a common source'
save_
save_database_related.special_details
_definition.id '_database_related.special_details'
_definition.update 2018-06-29
_description.text
;
Information about the external dataset and relationship not encoded
elsewhere.
;
_name.category_id database_related
_name.object_id special_details
_type.purpose Describe
_type.source Recorded
_type.container Single
_type.contents Text
save_
#
# Contents to be added to templ_enum.cif listing database codes
#
save_database_list
loop_
_enumeration_set.state
_enumeration_set.detail
CAS 'Chemical Abstracts'
COD 'Crystallographic Open Database'
CSD 'Cambridge Structural Database'
ICSD 'Inorganic Crystal Structure Database'
MDF 'Metals Data File'
NDB 'Nucleic Acid Database'
PDB 'Protein Data Bank'
PDF 'Powder Diffraction File (JCPDS/ICDD)'
RCSB 'Research Collaboratory for STructural Bioinformatics'
EBI 'European Bioinformatics Institute'
save_
Seems great at a first glance. A few notices, though:
1) The COD is expanded as "Crystallography Open Database", not "Crystallographic Open Database". It might not embody the best English grammar, but that's the historic name. 2) A usage example, i.e. placed in the category definition would be useful. 3) The data names _database_related.database_id and _database_related.database_code look very similar (the distinction between an id and a code is not initially clear). Renaming the _database_related.database_code to something that explicitly refer to the database entry (i.e. _database_related.entry_code or _database_related.database_entry_code) might help. Of course, this is just a suggestion.
The latest version contains these definitions. I have chosen 'entry_code' as the alternative. Is anybody able to produce a realistic example to include in the category definition?
I will replace this issue with an enhancement issue to provide an example for the category definition.
The core DATABASE and DATABASE_CODE categories are intended to provide a single entry per datablock, usually inserted into the datablock when it is ingested by the database and then included when provided by the database. A better design would be to have the database name as an enumerated data name, which allows simple extension to other databases not included in the current list.
Additionally, a structure may be present in multiple repositories. While including information for multiple data blocks is somewhat difficult in the single crystal context (and unnecessary), in some contexts (for example topological databases) the particular structure may be unambiguously present in multiple collections. Therefore, this category should ideally be looped. I suggest a new category (to avoid the current category, which is to be considered "authoritative")