biopragmatics / bioregistry

📮 An integrative registry of biological databases, ontologies, and nomenclatures.
https://bioregistry.io
MIT License
120 stars 53 forks source link

General policy for curating shared information for subdivided prefixes #1222

Open bgyori opened 1 week ago

bgyori commented 1 week ago

We have run into this issue in a couple of different settings now so we might want to discuss a general approach to the problem of metadata curated across semantic spaces for a given database.

Example: pubchem.

Currently, the data that is generic for the entire database is not propagated in a predictable way into semantic space-specific prefixes. The fact that some data is surfaced via prioritized mappings makes the situation more complicated. For example, we have

So the main question of this issue is: should this type of data be standardized based on its shared vs subspace-specific status? If so, what should be the general policy for this?

This is relevant for e.g., #1214 and #1204 and many existing records.

cthoyt commented 1 week ago

see the addition to curation guidelines added https://github.com/biopragmatics/bioregistry/pull/1217

bgyori commented 1 week ago

Thanks @cthoyt! In addition to thinking about this problem (which is more general than just related to publications) for new prefixes, I think there are important implications on existing prefixes, since this problem is pretty pervasive. I was curious to see empirically what the situation looks like so I wrote a bit of code to check, using the exported registry so that external mappings' effects are accounted for:

from collections import defaultdict

import matplotlib.pyplot as plt
import requests

res = requests.get('https://raw.githubusercontent.com/biopragmatics/bioregistry/'
                   'refs/heads/main/exports/registry/registry.json')
registry = res.json()

# Organizing prefixes by root
prefixes_by_root = defaultdict(list)
for prefix, data in registry.items():
    prefix_parts = prefix.split('.')
    prefix_root = prefix_parts[0] if len(prefix_parts) == 1 else '.'.join(prefix_parts[:-1])
    prefixes_by_root[prefix_root].append(prefix)

# Quantifying families with shared root
families = {root: prefixes for root, prefixes in prefixes_by_root.items()
            if len(prefixes) > 1}
nprefixes = len(registry)
nfamilies = len(families)
nprefixes_in_families = sum(len(prefixes) for prefixes in families.values())
print(f'Total prefixes: {nprefixes}, out of which {nprefixes_in_families} '
      f'prefixes are in a total of {nfamilies} families.')

nrooted = len([_ for root, prefixes in families.items() if root in prefixes])
print(f'Out of {nfamilies} families, {nrooted} have a root, '
      f'the remaining {nfamilies - nrooted} are rootless.')

# Analyzing publications

pub_prevalences = []

key_priority = ['pubmed', 'doi', 'url']
def get_pub_key(pub):
    for key in key_priority:
        if key in pub:
            return pub[key]

for root, prefixes in families.items():
    prefixes_by_pub = defaultdict(list)
    for prefix in prefixes:
        for publication in registry[prefix].get('publications', []):
            prefixes_by_pub[get_pub_key(publication)].append(prefix)
    for pub, prefixes_for_pub in prefixes_by_pub.items():
        pub_prevalence = len(prefixes_for_pub) / len(prefixes)
        pub_prevalences.append(pub_prevalence)

_ = plt.boxplot(pub_prevalences)

which produces

Total prefixes: 1802, out of which 505 prefixes are in a total of 163 families.
Out of 163 families, 67 have a root, the remaining 96 are rootless.

image

This means that a really large proportion ~28% of prefixes are in a "family" of this type. Though these choices are often well justified based on the primary foucs of a database, there is no clear pattern in terms of which families have a "root" prefix (e.g., we have pfam and pfam.clan in the pfam family but we don't have a pubchem root). Per the box plot above, publications are curated in a pretty ad-hoc way for existing prefixes in a family. It's likely that a small number of these publications are prefix-specific but most of them should be shared across the family.

I wonder if it would be worth trying to standardize some of this automatically (with post-hoc quality control/curation) for publications - and potentially other metadata in a similar status.

cthoyt commented 1 week ago

See also the part_of field, described at https://biopragmatics.github.io/bioregistry/datamodel/#part-of where the instances you described plus additional ones have been curated. You can get a quick overview on https://bioregistry.io/highlights/relations

more generally I've considered how to share information between related resources (not just publications, but also homepage, contact person, etc.). Open to suggestions but having a curation script that copies publications is a quick and dirty solution