HumanBrainProject / openMINDS

openMINDS comprises a set of metadata models for research products in the field of neuroscience.
MIT License
40 stars 13 forks source link

Discussion about openMINDS-chemistry extension. #32

Closed Peyman-N closed 1 year ago

Peyman-N commented 2 years ago

As a response to the issues raised in the isses on openMINDS-ephys #7, #8 #9 etc, we (me and @apdavison ) are fast tracking the openMINDS-chemistry. The extension is ready and awaits a final decision on how we are handling the solute and solvent. Option one presented in the flowing figure and available on this repository is more intuitive to understand : option1

While option two presented in the flowing figure and available in this repository, add less object to KG and makes it easier to distinguish between the liquid solutions and solid ones. option2

@lzehl @UlrikeS91 what is the best course of action in your idea? Finally I was thinking to add antibodies specific schema to this extension, but if we decide to this we should change the name of the extension to openMINDS-substances. What do you think about that?

lzehl commented 2 years ago

@Peyman-N and @apdavison thanks for pushing this. I think both options have good aspects but I don't think either is a perfect solution.

Before providing feedback I'd like to summarize some information around the topic:

A chemical product could be interpreted in two different ways: It is either some chemical material that I can buy or the result of a chemical reaction (chemical transformation of one set of chemical substances to another). For the first interpretation, a chemical product can be either a chemical substance or a chemical mixture of substances.

A chemical substance can be a chemical element, chemical compound, or a chemical alloy.

A chemical element consists only of atoms that all have the same numbers of protons in their nuclei.

A chemical compound is composed of many molecular entities composed of atoms from more than one element held together by chemical bonds.

A molecular entity (or chemical entity) is any distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer, etc., identifiable as a separately distinguishable entity. A molecule is a group of two or more atoms held together by chemical bonds.

A chemical alloy is a material composed of chemical elements of which at least one is a metal where the material retain all the properties of that/those metal(s). A chemical alloy can actually also be defined as a solid chemical mixture.

A chemical mixture is a material made up of two or more different chemical substances which are not chemically bonded. A chemical mixture can be uniform (homogeneous) or non-uniform (heterogeneous). The three big families of chemical mixtures are solutions (uniform), suspensions (heterogeneous), and colloids (heterogeneous).

Uniform mixtures have a uniform appearance, or only one visible phase, because the particles are evenly distributed. Solutions are uniform mixtures where the ratio of solute to solvent remains the same throughout the solution, particles are not visible with the naked eye, solutes will not settle out after any period of time and solutes can't be removed by some physical methods such as a filter or centrifuge. Non-uniform mixtures have a non-uniform appearance, where the constituent substances are easily distinguishable from one another (often, but not always, in different phases). If a mixture is a solution, suspension or colloid may differ with the ratio of solute(s) to solvent.

Bases on this information now my FEEDBACK / QUESTIONS for the openMINDS_chemistry extension:

We need to first clarify for what the openMINDS_chemistry extension should be used. Is it for describing all chemical substances with their properties, for describing chemical mixtures, for describing who made the chemical substances or mixtures, AND/OR for describing the provenance and properties of chemical reactions, etc.?

In my opinion we currently have the need for describing chemical mixtures with their properties, and if a chemical mixture or a chemical substance was used who made it. We here need to clearly distinguish between general descriptions of chemical substances/mixtures VS who made them to allow for maximum data integration. Later on chemical reactions could be added to the extension as well.

I also think we should keep very well defined chemical substances as controlled terms instead of defining separate schemas. Meaning chemical elements, ions, molecules, compounds (including antibodies) etc should remain for now in the controlled terms and link to external resources. Even if purchasable substances are made by different vendor the chemical description of specific chemical substances has to be exactly the same otherwise it is a different chemical substance. Based on the definitions above we can keep those terms currently in controlledTerms/MolecularEntities.

I suggest therefore to modify the current options for the extensions (a suggestion from my side will follow asap).

lzehl commented 2 years ago

a paper that could be of interest here https://doi.org/10.1186/s13321-019-0357-4

lzehl commented 2 years ago

Here some suggestions for the extensions focusing on vendor independent metadata for chemical mixtures (keeping your suggestions, the gathered information and the approach of the paper in mind):

Schema suggestion ChemicalMixture: property value count value type
"type" 1 controlledTerms/ChemicalMixtureType
"solvent" 1 ChemicalIngredient OR ChemicalMixture
"solute" 1-N ChemicalIngredient
"aggregationState" 1 controlledTerms/AggregationState
"remarks" 0-1 string
Schema suggestion ChemicalIngredient: property value count value type
"chemicalSubstance" 1 controlledTerms/MolecularEntity
"concentration" 0-1 core/QuantitativeValue OR core/QuantitativeValueRange
"mass" 0-1 core/QuantitativeValue OR core/QuantitativeValueRange
"aggregationState" 1 controlledTerms/AggregationState
"remarks" 0-1 string

controlledTerms/ChemicalMixtureType instances: solution, suspension, colloid controlledTerms/AggregationState instances: solid, liquid, gas "remarks" are meant to be used for any additional description or usage suggestion (prop. name could be different).

If we want to capture trademark information as well I suggest to use a separate schema and point back to the general chemical information. Other schemas can then point either directly to the general chemical schemas OR indirectly via the trademark schema if they want to specify a specific trademark they used. This way the general chemical information can be used without a trademark (e.g. when it's self-made as resulting entity of e.g. mixing activity) or reused across trademarks increasing data integration:

Schema suggestion Trademark: property value count value type
"product" 1 ChemicalMixture OR ChemicalIngredient
"name" 1 string
"manufacturer" 1-N core/Person OR core/Organization
"catalogReferenceNumber" 0-1 string
"serialNumber" 0-1 string
"digitalIdentifier" 0-N RRID OR ?

@Peyman-N and @apdavison I hope this is useful feedback and will help you to shape a first (fast-tracked) extension for chemistry.

UlrikeS91 commented 2 years ago

These are good suggestions, but I'm not sure they will work.

The trademark schema is a bit confusing. The trademark should be a property of the chemical or chemical mixture. Yes, a chemical or mixture can be distributed/manufactured at different places, but there are differences and a trademark would also imply that it is a single protected entity. But this setup make it possible to link a chemical or a mixture on 2 trademark schemas, which is not sensible. This will also make it hard to use a specific tradmarked chemical or mixture for an experimental step.

The chemical mixture is mostly fine except for the aggregation state. What would this be for e.g., a suspension? liquid? Technically, that would be wrong. The solvent might be, but the solute is a solid and per definition not dissolve in the solvent. I would just remove it from there. The chemical mixture type in combination with the differentiation between solvent and solute should be enough.

I would like to rename the "ChemicalIngredient" to e.g. "ChemicalComponent", "ChemicalSubstance" or "ChemicalAgent" (but I think the latter might not be quite right...). "Ingredient" makes it to specific to the use in a mixture, but a chemical can be used as is - without mixing it.

I'm also struggling with the quantities. In principle, a chemical component shouldn't have a concentration because it's not a mixture (yet). What would be more interesting is the purity. Mass could be used but this would then be the mass used in the experiment or for the mixture, correct? If so, I would say it makes more sense to rename it to "usedQuantity" (which would then include mass and volume). Concentration should be moved to mixture and I would like to have a ratio as well (e.g. when mixture A and mixture B are mixed 1:2).

Here the revision of @lzehl suggestion:

Schema suggestion ChemicalMixture: property value count value type
"type" 1 controlledTerms/ChemicalMixtureType
"solvent" 1 ChemicalComponent OR ChemicalMixture
"solute" 1-N ChemicalComponent
"concentration" 0-1 core/QuantitativeValue OR core/QuantitativeValueRange
"ratio" * 0 or 2-N core/QuantitativeValue
"additionalRemarks" 0-1 string
"trademark" (embedded) 0-1 Trademark

* this probably needs adjustments; own schema seems a bit too much but this solution may not be very explicit

Schema suggestion ChemicalComponent: property value count value type
"chemicalSubstance" 1 controlledTerms/MolecularEntity
"purity" 0-1 core/QuantitativeValue OR core/QuantitativeValueRange
"usedQuantity" 0-1 core/QuantitativeValue OR core/QuantitativeValueRange
"aggregationState" 1 controlledTerms/AggregationState
"additionalRemarks" 0-1 string
"trademark" (embedded) 0-1 Trademark

controlledTerms/ChemicalMixtureType instances: solution, suspension, colloid controlledTerms/AggregationState instances: solid, liquid, gas

Schema suggestion Trademark: property value count value type
"name" or "productName" 1 string
"manufacturer" 1 core/Person OR core/Organization
"catalogReferenceNumber" 0-1 string
"serialNumber" 0-1 string
"digitalIdentifier" 0-1 RRID
apdavison commented 2 years ago

Thank you @lzehl and @UlrikeS91 for the feedback.

What is the extension for? Reproducibility and interpretation. A scientist in the same field should be able to recreate or purchase the same chemical (pure or mixture), or identify problems with the protocol that was used (e.g. if a specific batch from a given manufacturer was known to be contaminated). The reason for having this information in the graph rather than as text is so that people can search/aggregate (e.g. find studies that use a given anaesthetic).

I think we are broad agreement on the schemas. I would like to propose some minor modifications, based on the observation that an ingredient of a mixture can itself be a mixture.

Chemical (could also be called Substance, ChemicalSubstance or ChemicalProduct (since we're not trying to describe chemical reactions, the meaning of 'product' is reasonably unambiguous in our context): property value count value type
"components" 1-N AmountOfSubstance
"aggregationState" 0-1 controlledTerms/AggregationState
"source" 0-1 ProductSource
"additionalRemarks" 0-1 string
AmountOfSubstance (could also be called ChemicalIngredient or ChemicalComponent): property value count value type
"chemicalSubstance" 1 controlledTerms/MolecularEntity OR Chemical
"amount" 0-1 core/QuantitativeValue OR core/QuantitativeValueRange
"role" 0-1 controlledTerms/ChemicalRole
"purity" 0-1 number (between 0 and 1, or 0 and 100%)
ProductSource (The name "Trademark" is not appropriate, as this information is needed for reproducibility/provenance, not intellectual property): property value count value type
"name" or "productName" 1 string
"provider" 0-1 core/Person OR core/Organization
"catalogReferenceNumber" 0-1 string
"serialNumber" or "batchNumber" 0-1 string
"digitalIdentifier" 0-1 RRID

Some comments:

lzehl commented 2 years ago

@UlrikeS91 and @apdavison thanks for picking up my suggestions. here my feedback (first for now only for the ProductSource / Trademark; feedback on the others will follow)

CommercialSource: (I suggest that name because we are only interested in additional information when substances were not selfmade but purchased) property value count value type
"productName" 1 string (instructions should be clear that the commercial name is meant here)
"vendor" 1 core/Person OR core/Organization (let's reuse vendor from the strain/stock number schemas)
"identifier" 0-N string (specify in instruction that these are not globally unique identifiers used by the vendors, e.g. catalogue reference number)
"digitalIdentifier" 0-1 RRID

"batchNumber" and "serialNumber" would meant that this is exactly one purchased item of a product provided by a certain vendor. If you want to keep the connection direction from ChemicalProduct to CommercialSource I suggest to not ask for those properties and stay of the level of the overall product, because otherwise ChemicalProduct instances with stated sources cannot be reused by any other study. If you want to keep that level of precision I strongly suggest to reverse the connection from CommercialSource to ChemicalProduct and allow in respective schemas to either specify directly a ChemicalProduct without a source or indirectly specify a ChemicalProduct via a CommercialSource.

UlrikeS91 commented 2 years ago

@lzehl and I had a discussion about this last week to sort our thought on this. We had very different ideas about the purpose of the schemas. I believe that this is part of why we couldn't land on anything yet. The semantics have additionally contributed to this.

Let's start with the semantics: The chart below illustrates and defines the two types of matter. A substances is part of a mixture, but a mixture cannot be a substance. So, the 2 main schemas could either be called "ChemicalSubstance" and "ChemicalMixture" or "ChemicalComponent" and "ChemicalMixture". In the context of the mixture, I find "ChemicalComponent" more suitable because "ChemicalSubstance" gives the impression that the schema describes the substance and not so much its role within a mixture. But substances can be used without mixing them beforehand, and in this context "ChemicalSubstance" would be better.
image

In any case, when more than one substance is possible or even supposed to be add, the schema cannot be called "chemicalSubstance" (or "chemical" or "chemicalComponent"). Per definition, this would be a mixture. Similarly, a schema that is supposed to represent a mixture must link to at least 2 (or more) chemical substances. Therefore, a single property (e.g. called "chemicalComponents" or "chemicalSubstances") must be restricted to 2 - N or there must be 2 required properties (1 - 1 or 1 - N) basically the "solvent" and "solute" approach as @lzehl proposed.

@apdavison, I understand the issue that you have with "solvent" and "solute" in a non-solution context. I had the same issue and it is most commonly used in the context of a solution, but it turns out that these terms can be used for other types of mixtures. The broader definitions of the terms are: solute - material present in the smaller amount in the mixture solvent - material present in the larger amount in the mixture The disadvantages of using these terms are quite obvious, but a major advantage is that the relation between the components of a mixture are better defined this way and the additional controlledTerm schema for defining those roles would not be needed. Unfortunately, there seems to be no universally applicable alternatives that describe the same relationship. Personally, I would prefer to stick to "solute"/"solvent" or come up with alternative property names for these instead of the controlledTerms schema for the role.

I'd like to propose a new setup that is hopefully more thought through than my previous suggestion. First the schemas and their content in tabular form and then a schematic overview. I also added potential instructions for each of the properties so that the purpose becomes more clear and the property names become less important in understanding the proposal (Note: Some of the property names are not ideal and I would be happy to exchange them with something else.)

ChemicalSubstance

property count value (potential) instruction
chemicalSubstance 1 CT/MolecularEntity or category/ChemicalSubstance Add the name of the substance as defined by the molecular entities.
aggregationState 1 CT/AggregationState When used in a mixture, add the aggregation state of the substance within the mixture. When used in its pure state, add the aggregation state of the substance itself.
(pro)portion 0 - 1 core/QV When used in a mixture, enter the amount of the substance within the mixture (e.g. as concentration or as ratio***). When used in its pure state, add the used amount of the substance.
additionalRemarks 0 - 1 string Enter additional remarks about the chemical substance (pure or within a mixture).
(commercial)Source 0 - 1 CommercialSource Add the source of the chemical substance from which it can be purchased.

*** this requires an additional CT/UnitOfMeasurement called e.g. "part" so that ratios can be described as e.g. 1 part substance X + 3 parts substance Y (real-life use case: antibody mixtures, e.g. 1:1000 dilution of antibody in a buffer solution)

ChemicalMixture

property count value (potential) instruction
type 1 CT/ChemicalMixtureType Add the type of the mixture.
solvent 1 ChemicalSubstance or ChemicalMixture Add the substance or mixture of substances that is present in the larger amount within this mixture.
solute 1 - N ChemicalSubstance, ChemicalMixture or category/ChemicalSubstance Add the substance or mixture of substances that is present in the smaller amount within this mixture.
(pro)portion 0 - 1 core/QV When used in a mixture, enter the amount of the substance within the mixture (e.g. as concentration or as ratio***). Otherwise, leave blank.
additionalRemarks 0 - 1 string Enter additional remarks about this chemical mixture.
(commercial)Source 0 - 1 CommercialSource Add the source of the chemical mixture from which it can be purchased.

CATEGORY/ChemicalSubstance

This is more of a future plan for special molecular entities that require additional information that are not covered by the CT/MolecularyEntity, such as antibodies, neuronal tracers, drugs/medicine, etc.

CommercialSource

property count value (potential) instruction
productName 1 string Enter the name of the product as provided by the vendor.
vendor 1 category/LegalPerson Add the vendor of the product.
purity 0 - 1 core/QV or core/QVR Enter the purity of the product as stated by the vendor.
identifier 0 - N string Enter one or several identifiers for this product excluding globally unique identifiers such as RRIDs.
digitalIdentifier 0 - 1 core/RRID Add the globally unique identifier for this product (e.g. 'Research Resource Identifier' (RRID)).

Note: I moved "purity" to the commercial source, because I envisioned it to be used only when the substance/mixture has been bought from somewhere.

Schematic overview

image

This overview also has some first examples for schemas that would use the category/ChemicalSubstance. These are only drafts! Property names and content can absolutely change, but it conveys the idea for which use cases the category may be useful.

I'm sorry for contributing to the confusions and I hope that this may solve at least some of them ☺️

lzehl commented 2 years ago

(Meeting Ulrike, Peyman, Andrew, Lyuba) Discussion results:

Add property "lookupLabel" to all schemas that do not have name.

Peyman-N commented 2 years ago

Here is what we agreed on so far. Please check, whether it matches your recollection.

ChemicalSubstance

property count value (potential) instruction
lookupLabel 0-1 string Add an appropriate look up label for this substance.
molecularEntity 1 CT/MolecularEntity Add the name of the molecular entity that makes up the substance.
purity 0 - 1 core/QV or core/QVR Enter the purity of this chemical substance.
productSource 0 - 1 ProductSource Add the source of the chemical substance.
additionalRemarks 0 - 1 string Enter additional remarks about the chemical substance (pure or within a mixture).

AmountOfChemical

property count value (potential) instruction
chemical 1 ChemicalSubstance or ChemicalMixture or MolecularEntity --
amount 0 - 1 core/QV When used in a mixture, enter the amount of the substance within the mixture (e.g. as concentration or as ratio***). When used in its pure state, add the used amount of the substance.

ChemicalMixture

property count value (potential) instruction
name 0-1 string Add an appropriate name for this chemical mixture.
type 1 CT/ChemicalMixtureType Add the type of the mixture.
component 2 - N AmountOfChemical Add the components of this chemical mixture.
additionalRemarks 0 - 1 string Enter additional remarks about this chemical mixture.
productSource 0 - 1 ProductSource Add the source of the chemical mixture.

ProductSource

property count value (potential) instruction
productName 1 string Enter the name of the product as provided by the vendor.
provider 1 category/LegalPerson Add the vendor of the product.
purity 0 - 1 core/QV or core/QVR Enter the purity of the product as stated by the vendor.
identifier 0 - N string Enter one or several identifiers for this product excluding globally unique identifiers such as RRIDs. e.g. catalog number
digitalIdentifier 0 - 1 core/RRID Add the globally unique identifier for this product (e.g. 'Research Resource Identifier' (RRID)).
Peyman-N commented 2 years ago

Another point is the fact we should decide on the name of this extension. For the moment we have three options : 1 openMINDS-chemistry 2 openMINDS-chemicals 3 openMINDS-chem We want to avoid the misconception that we are trying to represent all of chemistry.

lzehl commented 2 years ago

@Peyman-N I would go with openMINDS_chemicals. @apdavison and @UlrikeS91 what do you think?

lzehl commented 2 years ago

@UlrikeS91 due to the close relation I would put tracer, antibodies, etc into the openMINDS_chemical extension.

UlrikeS91 commented 2 years ago

@Peyman-N your summary in this comment looks good to me and seems o have all the updates that we discussed :)

I would also vote for openMINDS_chemicals.

@UlrikeS91 due to the close relation I would put tracer, antibodies, etc into the openMINDS_chemical extension.

@lzehl, yes. Good idea. Nice to collect them in one place and they are of course chemicals 😉

Peyman-N commented 2 years ago

@lzehl and @UlrikeS91 thanks a lot.

openMINDS_chemicals was mine and @apdavison preference too.

@UlrikeS91 due to the close relation I would put tracer, antibodies, etc into the openMINDS_chemical extension.

Actually I wanted to raise this with you. I was asking about it from Andrew on Monday.

Finally, @UlrikeS91 can you please create a openMINDS logo for openMINDS_chemicals please.

apdavison commented 2 years ago

great. @Peyman-N I've forked your repository to https://github.com/HumanBrainProject/openMINDS_chemicals - please go ahead and update the JSON schemas.

apdavison commented 2 years ago

@lzehl @UlrikeS91 @Peyman-N you've all got admin permissions for HumanBrainProject/openMINDS_chemicals

UlrikeS91 commented 2 years ago

Finally, @UlrikeS91 can you please create a openMINDS logo for openMINDS_chemicals please.

Done ☺️

Peyman-N commented 2 years ago

@UlrikeS91 thanks a lot :relaxed:

Everyone the version 1 is ready :partying_face:, thanks a lot for your inputs and contributions. Please let me know if you have any additional remarks.

P.S. I would create an issue on openMINDS-chemicals so we can disuses the future schema that we can add to version 2 like: tracer, antibodies, etc.

lzehl commented 2 years ago

@Peyman-N for consistency with the other repositories the name of the repository has to be openMINDS_chemicals not openMINDS-chemicals

lzehl commented 2 years ago

@Peyman-N never mind I just saw the Andrew already set it up correctly :+1: :slightly_smiling_face:

lzehl commented 1 year ago

@apdavison , @Peyman-N , @UlrikeS91 I think we can close this now. The extension can be further discussed on it's own GitHub: https://github.com/HumanBrainProject/openMINDS_chemicals