COMCIFS / CIF_JSON

A JSON schema for transferring CIF information.
3 stars 2 forks source link

Flat Loops #1

Open zthatch opened 3 years ago

zthatch commented 3 years ago

Hello!

I am a member of the Billinge Group at Columbia working on creating a pydantic adaptation of your json schema, and I just realized that you flatten the tables in your example.

This seems like it would be an issue when converting back to CIF from JSON, and it looses some of the structure that was present.

It seems like a simple solution might be to give values a hash ID, then put a loop property alongside frames n datablock, and have that loop property contain a patternproperty that references values? I'll submit a PR with this.

I only realized this issue once I saw that pycifrw also flattens tables when you ask for a dictionary of the datablock keys. Is this linked to that?

Let me know!

(Also, if you have any script for parsing CIFs into this format, I would love to see it)

jcbollinger commented 3 years ago

Hello!

Thanks for your interest in CIF-JSON. I take it that by "flattening the tables" you mean that CIF-JSON does not preserve the lexical association of data items into loop structures that is present in the native CIF serialization format. You are right that that means that one cannot reliably read CIF data from a native-format CIF file, convert it to CIF-JSON format, and then accurately reconstitute the original native-format file. This is known and intentionally accepted.

A diverse group spent considerable time hashing out the details of CIF-JSON, and one of the questions we considered was whether it should provide a high fidelity representation of a native-format CIF file. We chose not to go this route, in part because the Crystallography Open Database already has such a JSON representation (which we sometimes referred to as "COD-JSON"). CIF-JSON is aimed mostly at different use cases, mainly as a format for conveying data to applications, especially web applications, for their direct consumption. It is not primarily intended as an archive format, for which I would personally recommend the native serialization format. We reasoned that the target applications know in advance which data names they want to use and how they are related to each other, which justified choosing a simpler JSON representation. That makes CIF-JSON slightly easier for such applications to use, too.

I observe also that the relationships among CIF data items are firstmost defined by the appropriate data dictionary. Items' organization into loop structures in native CIF format files is derivative of that, not a primary source of information in itself. Consequently, as long as all the needed data items are included, a native-format file with appropriate loops can be formed from data in CIF-JSON format by reference to the dictionary. And of course, CIF dictionaries expressed in CIF format are machine actionable, so that can, in principle, be automated.

I don't have a script per se for converting native CIF format into CIF-JSON, but I do somewhere have a C program for that purpose. I'll see whether I can dig that up.

zthatch commented 3 years ago

Hello!

That makes perfect sense.

Our group is working on an application to do a specific set of queries, and I've realized that the patternproperty nature of the keys in this schema make it very difficult to anticipate where someone may have hid the data that we are looking for with a basic query language command (mongo query language in our case).

We have instead opted for preprocessing before ingesting the data to the database and then archiving it in a more uniform/simple schema. This will keep our queries simpler and more uniform in turn.

Thank you for your time and helping me understand the use case!