Open bgyori opened 1 week ago
see the addition to curation guidelines added https://github.com/biopragmatics/bioregistry/pull/1217
Thanks @cthoyt! In addition to thinking about this problem (which is more general than just related to publications) for new prefixes, I think there are important implications on existing prefixes, since this problem is pretty pervasive. I was curious to see empirically what the situation looks like so I wrote a bit of code to check, using the exported registry so that external mappings' effects are accounted for:
from collections import defaultdict
import matplotlib.pyplot as plt
import requests
res = requests.get('https://raw.githubusercontent.com/biopragmatics/bioregistry/'
'refs/heads/main/exports/registry/registry.json')
registry = res.json()
# Organizing prefixes by root
prefixes_by_root = defaultdict(list)
for prefix, data in registry.items():
prefix_parts = prefix.split('.')
prefix_root = prefix_parts[0] if len(prefix_parts) == 1 else '.'.join(prefix_parts[:-1])
prefixes_by_root[prefix_root].append(prefix)
# Quantifying families with shared root
families = {root: prefixes for root, prefixes in prefixes_by_root.items()
if len(prefixes) > 1}
nprefixes = len(registry)
nfamilies = len(families)
nprefixes_in_families = sum(len(prefixes) for prefixes in families.values())
print(f'Total prefixes: {nprefixes}, out of which {nprefixes_in_families} '
f'prefixes are in a total of {nfamilies} families.')
nrooted = len([_ for root, prefixes in families.items() if root in prefixes])
print(f'Out of {nfamilies} families, {nrooted} have a root, '
f'the remaining {nfamilies - nrooted} are rootless.')
# Analyzing publications
pub_prevalences = []
key_priority = ['pubmed', 'doi', 'url']
def get_pub_key(pub):
for key in key_priority:
if key in pub:
return pub[key]
for root, prefixes in families.items():
prefixes_by_pub = defaultdict(list)
for prefix in prefixes:
for publication in registry[prefix].get('publications', []):
prefixes_by_pub[get_pub_key(publication)].append(prefix)
for pub, prefixes_for_pub in prefixes_by_pub.items():
pub_prevalence = len(prefixes_for_pub) / len(prefixes)
pub_prevalences.append(pub_prevalence)
_ = plt.boxplot(pub_prevalences)
which produces
Total prefixes: 1802, out of which 505 prefixes are in a total of 163 families.
Out of 163 families, 67 have a root, the remaining 96 are rootless.
This means that a really large proportion ~28% of prefixes are in a "family" of this type. Though these choices are often well justified based on the primary foucs of a database, there is no clear pattern in terms of which families have a "root"
prefix (e.g., we have pfam
and pfam.clan
in the pfam
family but we don't have a pubchem
root). Per the box plot above, publications are curated in a pretty ad-hoc way for existing prefixes in a family. It's likely that a small number of these publications are prefix-specific but most of them should be shared across the family.
I wonder if it would be worth trying to standardize some of this automatically (with post-hoc quality control/curation) for publications - and potentially other metadata in a similar status.
See also the part_of
field, described at https://biopragmatics.github.io/bioregistry/datamodel/#part-of where the instances you described plus additional ones have been curated. You can get a quick overview on https://bioregistry.io/highlights/relations
more generally I've considered how to share information between related resources (not just publications, but also homepage, contact person, etc.). Open to suggestions but having a curation script that copies publications is a quick and dirty solution
We have run into this issue in a couple of different settings now so we might want to discuss a general approach to the problem of metadata curated across semantic spaces for a given database.
Example:
pubchem
.When a resource like
pubchem
is subdivided into multiple semantic spaces includingpubchem.substance
orpubchem.compound
, these prefixes each contain some combination of information that isuri_format
,example
,pattern
) and others that are almost alwayspublications
,homepage
,contact
, etc.).There is also the question of mappings:
pubchem.compound
's mapping tomiriam
'spubchem.compound
or ton2t
'spubchem.compound
) while other mappings arepubchem.compound
's mapping tofairsharing
'sFAIRsharing.qt3w7z
)Currently, the data that is generic for the entire database is not propagated in a predictable way into semantic space-specific prefixes. The fact that some data is surfaced via prioritized mappings makes the situation more complicated. For example, we have
pubchem.compound
are specifically about PubChem's BioAssay subset for which there is a dedicated prefix at pubchem.bioassay but that record again only refers to the same 1 publication thatpubchem.substance
does.So the main question of this issue is: should this type of data be standardized based on its shared vs subspace-specific status? If so, what should be the general policy for this?
This is relevant for e.g., #1214 and #1204 and many existing records.