MassBank / MassBank-web

The web server application and directly connected components for a MassBank web server
14 stars 22 forks source link

Provide a MassBank schema.org DataFeed #357

Open sneumann opened 2 years ago

sneumann commented 2 years ago

Hi, the 2022 Biohackathon has project 23 to consume schema.org DataFeeds. @albangaignard or @AlasdairGray could point us to what's needed to provide our existing schema markup in such a form. Yours, Steffen

sneumann commented 2 years ago

There is more information in https://github.com/BioSchemas/specifications/discussions/579 Yours, Steffen

sneumann commented 2 years ago

We have a RecordExporter command line tool to convert from the MassBank text records to HTML like in the web application: https://github.com/MassBank/MassBank-web/blob/main/MassBank-Project/MassBank-lib/src/main/scripts/RecordExporter

To create the JSON dump we can either take the exported HTML and extract the <script type="application/ld+json">, or @meier-rene even adds a command-line switch --json-only to export only that.

Yours, Steffen

sneumann commented 2 years ago

I love command line tools:

MassBank-web/MassBank-Project/MassBank-lib/target/MassBank-lib/MassBank-lib/bin/Inspector  \
  IPB_Halle/MSBNK-IPB_Halle-PB005803.txt /dev/stdout \
  | xmllint -html --xpath 'string(//html/head/script[@type = "application/ld+json"]/text())' - 2> /dev/null

and of course also for the entire MassBank-data (not super-fast, though...)

find . -name "MSBNK-IPB_Halle-PB00048*.txt" -exec sh -c "/vol/massbank/src/MassBank-web/MassBank-Project/MassBank-lib/target/MassBank-lib/MassBank-lib/bin/Inspector  {} /dev/stdout | xmllint -html --xpath 'string(//html/head/script[@type = \"application/ld+json\"]/text())' - 2> /dev/null " \; | jq -s 'add' >DataDump.json
sneumann commented 2 years ago

jq is quite strict about proper JSON. SMILES with stereo chemistry inside JSON can pose a problem: "smiles": "C1CC2=C(C(=CC=C2)O)OC3=CC=CC(=C3)/C=C\C4=CC(=C(C(=C4)O)O)OC5C=CC1C=C5" will give parse error: Invalid escape at line ... as already mentioned by Tobias in https://github.com/MassBank/MassBank-web/issues/316#issuecomment-958766083

So to massage the output we need cat DataDump.json | sed -e 's#\\#\\\\#' | jq -s 'add' >DataDump.jsonld Assuming that the smiles are the only place that has a \. This might not be necessary after fixing #316

sneumann commented 1 year ago

Thanks to @meier-rene we now have a DadaDump created via

MassBank-web/MassBank-Project/MassBank-lib/target/MassBank-lib/MassBank-lib/bin/Msbnk2JSONLD -o MassBank-2006.06.jsonld $(ls -d MassBank-data/* )

As sample is available from https://msbi.ipb-halle.de/~sneumann/MassBank-2006.06.jsonld Yours, Steffen

sneumann commented 1 year ago

Now that we know how to create a Data Feed, we need to serve it. According to https://schema.org/docs/feeds.html this goes into /.well-known/feeddata-general with the option to split if it becomes too large. That data feed should be created upon data import.

sneumann commented 1 year ago

We should also find a way to express the massbank-data version (git sha256 and/or the release) of a DataFeed.

tsufz commented 1 year ago

/.well-known/ is located in the Apache root to maintain the Letsencrypt challenge. The feeddata-general would be created in the Tomee root. Once implemented, we can redirect the request to the Tomee root.

sneumann commented 1 year ago

The DataDump is now created as part of the release process in https://github.com/MassBank/MassBank-web/blob/18d97bd21632e0c5c521ec4b19f7099200007807/create-releasefiles.sh#L13 and files go to the GitHub releases: https://github.com/MassBank/MassBank-data/releases/latest/download/MassBank.json We do not have that file in the running MassBank server yet. Yours, Steffen