ElixirTeSS / TeSS_scrapers

TeSS HTML page scrapers in Ruby looking for training resources and events metadata.
Other
9 stars 9 forks source link

Bioschemas parser in Javascript #71

Open njall opened 5 years ago

njall commented 5 years ago

Create a script that parses Bioschemas content.

Some of our content providers mark-up their events content using Bioschemas specifications. Bioschemas can be represented in either JSON-LD, RDFa, or Microdata formats. Just focus on JSON-LD for this exercise. If you have time later, maybe explore the others but no worries if not.

The Bioschemas Event specification is represented in a YAML format https://github.com/BioSchemas/specifications/blob/master/Event/specification.html

Write a program that:

  1. parses this file and takes everything in the properties: key in the YAML specification.
  2. Goes through and collect each property name
  3. Download the schema.org spec for each of the expected types. e.g. if expected_type has PostalAddress you need to parse schema.org/PostalAddress.jsonld to get the properties of this subtype
  4. Downloads a target Bioschemas web-page
  5. Parse the JSON-LD, maybe using a parser such as this: https://www.npmjs.com/package/@rdfjs/parser-jsonld
  6. Extract all the properties that match the ones you've collected from the YAML
  7. Extract any sub-properties
  8. Push these properties to TeSS using the TeSS API Client

Whilst implementing, think about how you make this as re-usable as possible. e.g. The developer will only have to change the URL of the target page to run it elsewhere.

Some target websites to test it against: