Harmonisation of dictionary naming convention

nautolycus commented 3 months ago

@vaitkus and I were musing over the wide variation in names and references to the various dictionaries, and note the following scatter of designators (G means version on GitHub (sometimes corresponds to DDL1/DDLm versions), P refers to the current latest versions on the wwPDB site (over which COMCIFS has no say). We feel it would be beneficial to harmonise these, in ways suggested below the table. I raise this as an issue in the core area, but it relates to all active dictionary projects. If we approve a harmonisation project, I suggest making changes initially to the electron density dictionary, which I'm currently revising and which has no active input from a wider community.

dictionary	ITG1e	ITG2e	Website	Logo	File	DDLm _dictionary.title
Core	coreCIF	coreCIF	coreCIF	CIF_core	cif_core.dic	CORE_DIC
Powder	pdCIF	pdCIF	pdCIF	CIF_pd	cif_pd.dic
					cif_pow.dic (G)	CIF_POW
Modulated struct	msCIF	msCIF	msCIF	CIF_ms	cif_ms.dic	CIF_MS
Electron density	rhoCIF	rhoCIF	rhoCIF	CIF_rho	cif_rho.dic	Cif_rho
Macromolecular	mmCIF	mmCIF	mmCIF	CIF_mm	cif_mm.dic	N/A
					mmcif_std.dic (P)
					mmcif_pdbx.dic (P)
Image	imgCIF	imgCIF	imgCIF	CIF_img	cif_img.dic	Cif_img.dic
					mmcif_img.dic (P)
Symmetry	symCIF	symCIF	symCIF	CIF_sym	cif_sym.dic	N/A
Restraints	N/A	.	.	CIF_restraints	cif_core_restraints.dic
					cif_restr.dic (G)	CIF_RSTR
Twinning	N/A	twinCIF	.	CIF_twinning	cif_twinning.dic
					cif_twin.dic (G)	CIF_TWIN
Magnetic	N/A	magCIF	magCIF	CIF_mag	cif_mag.dic	MAGNETIC_CIF
Topology	N/A	.	topoCIF	CIF_topo	cif_topology.dic	TOPOLOGY_CIF
					Topology.dic (G)
Multiblock core	N/A	.	"multiblock coreCIF"	.	cif_multiblock.dic	MULTIBLOCK_DIC
					multi_block_core.dic (G)


DDL1	.	.	.	CIF_DDL	ddl_core.dic	N/A
DDL2	.	.	.	CIF_DDL	mmcif_ddl.dic	N/A
					mmcif_ddl.dic (P)
DDLm	N/A	.	.	CIF_DDL	ddl.dic	DDL_DIC
					DDLm.dic (G)

Specific suggestions:

cif_pow.dic -> cif_pd
cif-core_restraints.dic -> cif_restr.dic and logo shortened to CIF_restr
cif_twinning.dic -> cif_twin.dic and logo shortened to CIF_twin
Topology.dic and cif_topology.dic -> cif_topo.cif and logo shortened to CIF_topo
multi_block_core.dic and cif_multiblock.dic -> cif_multi.dic and logo commissioned in form CIF_multi
Subject to above, short forms restrCIF, twinCIF, topoCIF and multiCIF be adopted where relevant in itg2e and IUCr website.
Retain mmcif_ddl.dic for the DDL2 dictionary, as no DDL2 development outside of the mmCIF area is envisaged.
_dictionary.title same as the filename without .dic suffix

Other comments

Do we need distinct logos for DDL1, DDL2 and DDLm?
Not sure what is the best approach for DDL1 and DDL2 file names. Thoughts welcome.
It might also be nice to homogenise the naming conventions of dictionary repositories on GitHub since now the names are a bit chaotic (e.g. cif_core, TopoCIF, Powder_Dictionary). We think the repository renaming on GitHub is not very painful, since the URL of the old repository should automatically redirect the user to the new URL

jamesrhester commented 2 months ago

Great work in compiling this table. I agree with all the suggestions.

I do not think DDLs need their own logos. That just draws attention to differences which are for most purposes irrelevant.

I think ddl filenames of the form ddl1.dic and ddlm.dic would work. We can't rename something the wwPDB have custody of - but we could make ddl2.dic resolve to mmcif_ddl.dic in those situations we have control over.

Yes, I agree that we should unify the repository names.

vaitkus commented 2 months ago

One more question -- should _dictionary.title attribute values be written in all uppercase letters as is being done now? This is mostly a stylistic consideration since _dictionary.title values are case-insensitive.

For example, the current name of the cif_core.dic dictionary is CORE_CIF, so should the name value be changed to cif_core or to CIF_CORE? I think that having it in uppercase makes sense for two reasons:

The _dictionary.title attribute value is also used as the category name of the HEAD category (see description of _name.category_id). Having it in uppercase would be consistent with all the other category names.
The _dictionary.title attribute value is also used as the name of the main dictionary data block (i.e. data_CIF_CORE). Having it in uppercase would be more consistent with the rest of the dictionary (category save frame names are written in all uppercase, data name save frames are written in all-lowercase).

jamesrhester commented 2 months ago

All uppercase works for me.

nautolycus commented 2 months ago

No objection.

vaitkus commented 2 months ago

@jamesrhester, I guess the "Change approved" label is appropriate or should this discussion be circulated elsewhere as well?

jamesrhester commented 2 months ago

Raising an issue/pull request in the respective repositories would be sufficient, referencing this issue.

vaitkus commented 2 months ago

It was agreed in an email exchange between @jamesrhester, @nautolycus and @vaitkus that repositories should be named same as the dictionary title except in all lower case, e.g. the CIF_CORE dictionary should reside in a repository called cif_core.

There is actually one more name that would be nice to standardise -- the name of the Head save frame. This save frame is referenced by other dictionaries when importing the entire dictionary, therefore I propose to keep the name of this save frame same as the data name. That is, the cif_core.dic dictionary file would have a single CIF_CORE data block which in turn would contain the CIF_CORE save frame (see PR #490 for an example of this layout). The import statement would then look something like:

[{'dupl':Ignore  'file':cif_core.dic  'mode':Full  'save':CIF_CORE}]

Alternatively, we could call the HEAD save frame something neutral like "head", since the dictionary name would already be reflected in import statements by the dictionary filename. The import statement would then look something like:

[{'dupl':Ignore  'file':cif_core.dic  'mode':Full  'save':HEAD}]

Which one looks better?

jamesrhester commented 1 month ago

I think the second one ('HEAD') provides more information to the casual reader, immediately showing that the whole of the imported dictionary is being enhanced.

vaitkus commented 1 month ago

Having the same name for the data block (dictionary) and the head save frame, while syntactically valid, seems to have already caused some issue with the external software (see https://github.com/emmo-repo/CIF-ontology/issues/194), so it would be best to avoid it.

I think the second one ('HEAD') provides more information to the casual reader, immediately showing that the whole of the imported dictionary is being enhanced.

I fully agree with these points, but after thinking about it some more, I think that we should also avoid having the same HEAD category name in all dictionaries since it might obfuscate certain errors (e.g. those dealing with import statements). Instead, I suggest that the HEAD category names be constructed by appending the _HEAD to the dictionary name (e.g. CIF_CORE_HEAD, CIF_PD_HEAD). What do you think?

jamesrhester commented 1 month ago

I think <dictionary_name>_HEAD is an excellent idea. I had also started wondering if it was really a good idea to have the same category name (HEAD) in many dictionaries.

nautolycus commented 1 month ago

I'm also happy to go with <dictionary_name>_HEAD. In favour of the simple HEAD proposal, that has, to my mind, the effect of establishing a formal starting point, just as '/' indicates the root of a filesystem. Importing to a new HEAD has, in this metaphor, a similar effect as a chroot call. On the other hand, I have in the past tripped up in a chroot'd environment precisely because inspection of the root designator doesn't give you a clue that you might be in an unexpected place.

COMCIFS / cif_core

Harmonisation of dictionary naming convention #488

Specific suggestions:

Other comments