Open sneumann opened 2 years ago
There is more information in https://github.com/BioSchemas/specifications/discussions/579 Yours, Steffen
We have a RecordExporter command line tool to convert from the MassBank text records to HTML like in the web application: https://github.com/MassBank/MassBank-web/blob/main/MassBank-Project/MassBank-lib/src/main/scripts/RecordExporter
To create the JSON dump we can either take the exported HTML and extract the <script type="application/ld+json">
, or @meier-rene even adds a command-line switch --json-only
to export only that.
Yours, Steffen
I love command line tools:
MassBank-web/MassBank-Project/MassBank-lib/target/MassBank-lib/MassBank-lib/bin/Inspector \
IPB_Halle/MSBNK-IPB_Halle-PB005803.txt /dev/stdout \
| xmllint -html --xpath 'string(//html/head/script[@type = "application/ld+json"]/text())' - 2> /dev/null
and of course also for the entire MassBank-data (not super-fast, though...)
find . -name "MSBNK-IPB_Halle-PB00048*.txt" -exec sh -c "/vol/massbank/src/MassBank-web/MassBank-Project/MassBank-lib/target/MassBank-lib/MassBank-lib/bin/Inspector {} /dev/stdout | xmllint -html --xpath 'string(//html/head/script[@type = \"application/ld+json\"]/text())' - 2> /dev/null " \; | jq -s 'add' >DataDump.json
jq
is quite strict about proper JSON. SMILES with stereo chemistry inside JSON can pose a problem:
"smiles": "C1CC2=C(C(=CC=C2)O)OC3=CC=CC(=C3)/C=C\C4=CC(=C(C(=C4)O)O)OC5C=CC1C=C5"
will give parse error: Invalid escape at line ...
as already mentioned by Tobias in https://github.com/MassBank/MassBank-web/issues/316#issuecomment-958766083
So to massage the output we need
cat DataDump.json | sed -e 's#\\#\\\\#' | jq -s 'add' >DataDump.jsonld
Assuming that the smiles are the only place that has a \
.
This might not be necessary after fixing #316
Thanks to @meier-rene we now have a DadaDump created via
MassBank-web/MassBank-Project/MassBank-lib/target/MassBank-lib/MassBank-lib/bin/Msbnk2JSONLD -o MassBank-2006.06.jsonld $(ls -d MassBank-data/* )
As sample is available from https://msbi.ipb-halle.de/~sneumann/MassBank-2006.06.jsonld Yours, Steffen
Now that we know how to create a Data Feed, we need to serve it.
According to https://schema.org/docs/feeds.html this goes into
/.well-known/feeddata-general
with the option to split if it becomes too large.
That data feed should be created upon data import.
We should also find a way to express the massbank-data version (git sha256 and/or the release) of a DataFeed.
/.well-known/
is located in the Apache root
to maintain the Letsencrypt challenge. The feeddata-general
would be created in the Tomee root
. Once implemented, we can redirect the request to the Tomee root
.
The DataDump is now created as part of the release process in https://github.com/MassBank/MassBank-web/blob/18d97bd21632e0c5c521ec4b19f7099200007807/create-releasefiles.sh#L13 and files go to the GitHub releases: https://github.com/MassBank/MassBank-data/releases/latest/download/MassBank.json We do not have that file in the running MassBank server yet. Yours, Steffen
Hi, the 2022 Biohackathon has project 23 to consume schema.org DataFeeds. @albangaignard or @AlasdairGray could point us to what's needed to provide our existing schema markup in such a form. Yours, Steffen