EnzymeML / PyEnzyme

🧬 - Data management and modeling framework based on EnzymeML.
BSD 2-Clause "Simplified" License
23 stars 9 forks source link

Add identifier types to AbstractSpecies base class #56

Open JR-1991 opened 1 year ago

JR-1991 commented 1 year ago

Description

Proteins and reactants may be purchased at vendors and thus possess serial numbers or DOIs which lead to more information. Thus, it is necessary to mirror this in EnzymeML.

Solution

Since identifiers appear in many flavors, the EnzymeML data model needs to accommodate for a variety of identifier types. Hence, it may be necessary to define a class that holds the actual persistent identifier and its scheme. This would result in the following XML:

<annotation>
    <enzymeml:persistentIdentifiers>
        <enzymeml:pID value="10.1093/ajae/aaq063" type="DOI" />
        <enzymeml:pID value="873673647367436" type="Serialnumber" />
    </enzymeml:persistentIdentifiers>
</annotation>
fbergmann commented 1 year ago

isnt serial number to generic? You would probably have to know specifics, to be able to resolve it later on. You might need more than one field to describe that. Maybe adding to the identifiers.org registry would be a better idea, from there you'd just need to store urls. Ensuring they could be resolved later:

doi example https://registry.identifiers.org/registry/doi

JR-1991 commented 1 year ago

isnt serial number to generic? You would probably have to know specifics, to be able to resolve it later on. You might need more than one field to describe that. Maybe adding to the identifiers.org registry would be a better idea, from there you'd just need to store urls. Ensuring they could be resolved later:

doi example https://registry.identifiers.org/registry/doi

Makes sense, so instead of the plain serial number/identifier we'd provide a URL to identifiers.org, given a schema exists? The only concern I have is that this requires these schemes to be registered beforehand.

As an intermediate solution in the case of a product, would it make sense to store the URL to the product page as well as the identifier? The URL might change within years, but the ID and the hint to the provider given in the URL would persist.

itbjpl commented 1 year ago

We would refer to a DOI, if the synthesis of the catalyst or the compound is described elsewhere in a paper or in a dataset. This would work well via the identifiers.org registry
But how can we refer to an enzyme or a compound, which has been supplied by a commercial vendor? We should even identify the lot or batch.

fbergmann commented 1 year ago

But how can we refer to an enzyme or a compound, which has been supplied by a commercial vendor? We should even identify the lot or batch.

this is exactly what i meant, by just referencing a generic serial number. it would be hard to be able to resolve it back, there could be an arbitrary number of fields needed to describe it. Of course it would always be possible to encode these parameters in an url. If it is not one of identifiers.org, as in the well known example (say for ec / kegg compounds) it could be something like:

https://enzymeml.org/compound?company=<companyname>&serialNo=<number>&batch.=...... 

or some such.

JR-1991 commented 1 year ago

Alright, so if I understand that correctly @fbergmann we would set up an EnzymeML identifier, as you described above, on identifiers.org? This would then be reflected onto the xml similar to how it it has been done for units in SBML?

fbergmann commented 1 year ago

what I meant to say, was that using identifiers.org and registering a scheme there would be ideal. For cases where this is not possible, the url scheme can still encode all parameters that might be needed otherwise.

JR-1991 commented 1 year ago

Sounds like a plan! According to your example the URL would include the following parameters:

@itbjpl anything other we should add? Once we compiled everything I'd register it.

fbergmann commented 1 year ago

that was just an example to point out that this way arbitrary parameters could be encoded that another tool could then later interpret. This will be needed, since a serial number alone is often not enough information to get back to what it is. I'm not suggesting to include those 3 specifically.

JR-1991 commented 1 year ago

Of course, it's a good starting point though in my opinion. We could work on a concrete schema within this issue or the webinar?