I looked at the metadata that's been assembled (https://github.com/biosimulations/biosimulations-physiome/blob/dev/projects.json). It looks pretty good. It looks like a few attributes need to be transformed for BioSimulators-utils and a couple of additional pieces of information could be scraped (markdown formatting for description, license, timestamp from Git commit)
[x] identifier:
Because of the two namespaces (e and exposure), I think the identifiers need to be e/xxx or exposure/xxxxxxxxxxxxxx
BioSimulators-utils utils expects the key to be identifiers rather than identifier.
BioSimulators-utils utils expects the value to be a list of dictionaries with keys
uri: http://identifiers.org/pmr:e/xxx
label: pmr:e/xxx
[x] hash: FYI, BioSimulators-utils will ignore this. The hash could be encoded into the source. See next bullet.
[x] source:
BioSimulators-utils expects this key to be sources rather than source
BioSimulators-utils expects the value to be a dictionary with two keys
uri: http://identifiers.org/pmr.workspace:35f/@@file/81ef7ed4cf06f0cd4b87da239d282fc559738796 (I think the hash could be encoded here; I just requested a identifiers.org prefix for this)
[x] description: Markdown can be used to capture the formatting. Here's a sketch of how it can be done
import bs4
import markdownify
import requests
response = requests.get('https://models.physiomeproject.org/e/3fd')
response.raise_for_status()
html = response.content
doc = bs4.BeautifulSoup(html)
content_core = doc.find(id='content-core').find('div')
for table in content_core.find_all(class_='tmp-doc-informalfigure table'):
table.decompose()
for image in content_core.find_all('img'):
image.decompose()
description = markdownify.MarkdownConverter().convert_soup(content_core).strip()
[x] summary: empty strings ("") should be converted to null
[x] thumbnails: For BioSimulators-utils, the thumbnails will need to be downloaded and the values of the thumbnails attribute will need to be converted to a path within the COMBINE archive (i.e. strip off everything up to the identifier)
[x] tags:
BioSimulators-utils expects this key to be keywords rather than tags
BioSimulators-utils expects the value of this key to be a list of strings
[x] citation
BioSimulators-utils expects this key to be references rather than citation
BioSimulators-utils expects the value to be a list of dictionaries with keys
[x] authors: BioSimulators-utils expects authors to be a list of dictionaries with keys
uri: null (preferably this would be ORCIDs, but we don't know these)
label: e.g., Geoffrey Nunns
[x] contributors: BioSimulators-utils expects contributors to be a list of dictionaries with keys
uri: http://identifiers.org/orcid:0000-0001-5801-5510 (preferably ORCID, another URI is fine too such as your personal website, GitHub profile, etc.)
label: Bilal Shaikh
[x] license:
I think we should scrape this because people care about preserving license information.
Most, but not all models are licensed CC BY 3.0. I think we need to scrape this from the web pages. Some pages say The terms of use/license for this work is unspecified. (i.e., "license": null). Some pages don't say anything about licenses, which I guess we can interpret as "license": null.
I think you could scrape this text from each model, calculate the set of unique strings (possibly as small as 2), and then assign each to the appropriate SPDX id.
BioSimulators-utils expects the license to be captured as a dictionary with two keys
uri: http://identifiers.org/spdx:CC-BY-3.0
label: CC BY 3.0
[ ] created: I think this could be set equal to the timestamp for the git commit.
I looked at the metadata that's been assembled (https://github.com/biosimulations/biosimulations-physiome/blob/dev/projects.json). It looks pretty good. It looks like a few attributes need to be transformed for BioSimulators-utils and a couple of additional pieces of information could be scraped (markdown formatting for description, license, timestamp from Git commit)
identifier
:e
andexposure
), I think the identifiers need to bee/xxx
orexposure/xxxxxxxxxxxxxx
identifiers
rather thanidentifier
.uri
:http://identifiers.org/pmr:e/xxx
label
:pmr:e/xxx
hash
: FYI, BioSimulators-utils will ignore this. The hash could be encoded into thesource
. See next bullet.source
:sources
rather thansource
uri
:http://identifiers.org/pmr.workspace:35f/@@file/81ef7ed4cf06f0cd4b87da239d282fc559738796
(I think the hash could be encoded here; I just requested a identifiers.org prefix for this)label
:pmr.workspace:35f/@@file/81ef7ed4cf06f0cd4b87da239d282fc559738796
title
: Looks good[x]
description
: Markdown can be used to capture the formatting. Here's a sketch of how it can be donesummary
: empty strings (""
) should be converted tonull
thumbnails
: For BioSimulators-utils, the thumbnails will need to be downloaded and the values of the thumbnails attribute will need to be converted to a path within the COMBINE archive (i.e. strip off everything up to the identifier)tags
:keywords
rather thantags
citation
references
rather thancitation
uri
:http://identifiers.org/pubmed/19486676
label
:An integrated model of eicosanoid metabolism and signaling based on lipidomics flux analysis. ...
. This method can be used to look up more complete metadata and generate a human-readable label for references https://github.com/biosimulators/Biosimulators_utils/blob/f8370913679828b6a45dad047123c0ab84a6f43d/biosimulators_utils/ref/utils.py#L23.authors
: BioSimulators-utils expects authors to be a list of dictionaries with keysuri
:null
(preferably this would be ORCIDs, but we don't know these)label
: e.g.,Geoffrey Nunns
contributors
: BioSimulators-utils expects contributors to be a list of dictionaries with keysuri
:http://identifiers.org/orcid:0000-0001-5801-5510
(preferably ORCID, another URI is fine too such as your personal website, GitHub profile, etc.)label
:Bilal Shaikh
license
:CC BY 3.0
. I think we need to scrape this from the web pages. Some pages sayThe terms of use/license for this work is unspecified.
(i.e.,"license": null
). Some pages don't say anything about licenses, which I guess we can interpret as"license": null
.http://identifiers.org/spdx:CC-BY-3.0
CC BY 3.0
created
: I think this could be set equal to the timestamp for the git commit.