POLDER-Crew / polder-federated-search

A federated search project for POLDER.
BSD 3-Clause "New" or "Revised" License
5 stars 1 forks source link

Index DataStream #181

Open yemoski opened 1 year ago

yemoski commented 1 year ago
  1. Use the API to get polar data (they have a lot of Great Lakes stuff too). The API docs are here: https://github.com/datastreamapp/api-docs
  2. The dataset landing pages all seem to be good schema.org json-ld, so I think we'd be in business.
yemoski commented 1 year ago

I've asked the folks who maintain the JSON-LD processor that Gleaner uses about it: Possible bug converting to RDF · piprate/json-gold · Discussion #68

yemoski commented 1 year ago

It's a bug in the Gleaner miller, possibly the same one that I was seeing with the NSIDC.

An example jsonld dataset blob: { "@context": {"@vocab":"https://schema.org/"}, "@type": "Dataset", "@id": "ad780d9a-e6a7-4a6f-824c-367b40d377b8", "name": "ACAP Saint John: Temperature Logger Data", "description": "ACAP Saint John’s Water Quality Monitoring Programs aim to build upon the community-based water monitoring program established in 1992 in the Greater Saint John Area. The data collected from these various monitoring programs is used to track stressors, determine restoration sites, strengthen/shape policy, and expand community partnerships. Water temperature is logged in various watercourses across the Greater Saint John using temperature logger deployed for months at a time.", "url": "https://datastream.org/dataset/ad780d9a-e6a7-4a6f-824c-367b40d377b8", "version": "2.0.0", "datePublished": "2019-02-04T22:19:58.129Z", "dateModified": "2022-04-30T15:48:42.840Z", "isAccessibleForFree": true, "keywords": "Harbour, Saint John, New Brunswick, Water temperature, Urban, Bay of Fundy", "license": "https://opendatacommons.org/licenses/by/1-0/", "citation": "ACAP Saint John. 2022-04-30. \"ACAP Saint John: Temperature Logger Data\" (dataset). 2.0.0. DataStream. https://doi.org/10.25976/12cf-1g82.", "identifier": { "@type": [ "PropertyValue", "datacite:ResourceIdentifier" ], "datacite:usesIdentifierSchema": { "@id": "datacite:doi" }, "propertyID": "DOI", "url": "https://doi.org/10.25976/12cf-1g82", "value": "10.25976/12cf-1g82" }, "temporalCoverage": "2016-08-04T13:54:00-03:00/2018-12-06T10:55:50-04:00", "spatialCoverage": { "@type": "Place", "geo": { "@type": "GeoShape", "box": "-66.19002 45.193191 -65.88223 45.40079" } }, "measurementTechnique": "Water temperature is logged using either HOBO or EasyLog temperature loggers attached to cinder blocks and left in the watercourse for various time periods based on the program.", "variableMeasured": [], "creator": { "@type": "Organization", "name": "ACAP Saint John" }, "publisher": { "@type": "Organization", "name": "DataStream", "url":"https://datastream.org", "logo":"https://datastream.org/favicon.svg" } }

Gets turned into just _:bcf4tc9a3cb2c73ao5p8g <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://schema.org/Organization> . _:bcf4tc9a3cb2c73ao5p8g <https://schema.org/name> "ACAP Saint John" . _:bcf4tc9a3cb2c73ao5p90 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://schema.org/PropertyValue> . _:bcf4tc9a3cb2c73ao5p90 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <datacite:ResourceIdentifier> . _:bcf4tc9a3cb2c73ao5p90 <datacite:usesIdentifierSchema> <datacite:doi> . _:bcf4tc9a3cb2c73ao5p90 <https://schema.org/propertyID> "DOI" . _:bcf4tc9a3cb2c73ao5p90 <https://schema.org/url> "https://doi.org/10.25976/12cf-1g82" . _:bcf4tc9a3cb2c73ao5p90 <https://schema.org/value> "10.25976/12cf-1g82" . _:bcf4tc9a3cb2c73ao5p9g <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://schema.org/Organization> . _:bcf4tc9a3cb2c73ao5p9g <https://schema.org/logo> "https://datastream.org/favicon.svg" . _:bcf4tc9a3cb2c73ao5p9g <https://schema.org/name> "DataStream" . _:bcf4tc9a3cb2c73ao5p9g <https://schema.org/url> "https://datastream.org" . _:bcf4tc9a3cb2c73ao5pa0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://schema.org/Place> . _:bcf4tc9a3cb2c73ao5pa0 <https://schema.org/geo> _:bcf4tc9a3cb2c73ao5pag . _:bcf4tc9a3cb2c73ao5pag <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://schema.org/GeoShape> . _:bcf4tc9a3cb2c73ao5pag <https://schema.org/box> "-66.19002 45.193191 -65.88223 45.40079" Only the organization information is in there. The dataset information itself is missing.

yemoski commented 1 year ago

example landing page: Carrot River Watershed Association Monitoring Program

yemoski commented 1 year ago

For some reason we are getting the organizations but not the datasets themselves in the triplestore.

yemoski commented 1 year ago

New plan: use the DataStream script that Yemisi wrote to grab just the DOIs and build a sitemap that lists the DOI urls. They'll resolve to the landing pages, with JSON-LD in them.

yemoski commented 1 year ago

Buuuuut the API doesn't return schema.org json-ld! It returns its own metadata format that looks like this: { "205": { "Id": null, "DOI": "10.25976/vmet-ct64", "Version": "4.0.0", "DatasetName": "LakeWatch Water Quality Data", "DataStewardEmail": "programs@alms.ca", "DataCollectionOrganization": "Alberta Lake Management Society", "DataUploadOrganization": "Alberta Lake Management Society", "ProgressCode": "onGoing", "MaintenanceFrequencyCode": "unknown", "Abstract": "The Alberta Lake Management Society accepts requests from citizen scientists across Alberta to have their lake monitored as part of the LakeWatch program. Volunteers often contact the Alberta Lake Management Society due to concerns around eutrophication, harmful algal blooms, watershed developments, biodiversity monitoring, and for the early detection of aquatic invasive species. If accepted into the program, a lake will be monitored 4-5 times throughout the open water season: once in June, once in July, twice in August, and once in September. ALMS hires and trains field technicians in proper sampling techniques and it is these field technicians who arrange the sampling trips with the citizen scientists. At the lake, the citizen scientist’s role is to transport the technicians around the lake on a boat and to assist with sampling. The field technicians provide all necessary sampling equipment, support the volunteers in training, and oversee sample preservation, handling, and shipment. This program is free of charge for individuals hoping to collect water quality data from their lake. This program is made possible with the support of various funders, including the Government of Alberta, and would not be possible without hundreds of hours of volunteer time by lake stewards.", "DataCollectionInformation": "Data is collected as both a profile and a composite sample. For profiles, Hydrolab multiparameter sondes are used to measure depth, temperature, conductivity, pH, dissolved oxygen, and redox potential at one location on the lake. Samples are recorded in increments of 0.5 or 1.0 meters. The multiparameter probes are calibrated on a weekly basis, with the exception of the RO meter which is calibrated monthly. At this location, a Secchi disk depth reading is also collected. The composite sample is comprised of water collected from ten locations around the lake. These samples are collected from the lake’s euphotic zone using a one-way foot-valve attached to weighted tubing. Samples collected using the composite method are ultimately poured off into various bottles to be tested for various parameters. Nutrients and general water chemistry are analyzed by Maxxam Analytics whereas total recoverable metals are analyzed by Innotech. Chlorophyll-a samples are filtered by field technicians before being sent for analysis.\n\nBiological data is also collected through the LakeWatch program. Biological data includes a zooplankton haul at the lake’s profile site, samples collected to detect spiny water flea and Dreissenid mussels, and a sample collected as part of the composite for phytoplankton taxonomy. \n\nA duplicate true-split total phosphorus sample is collected once per season at each lake to act as a quality control.", "DataProcessing": null, "FundingSources": null, "DataSourceURL": null, "OtherDataSources": null, "Citation": "Alberta Lake Management Society. 2022. \"LakeWatch Water Quality Data\" (dataset). 4.0.0. DataStream. https://doi.org/10.25976/vmet-ct64.", "Licence": "https://opendatacommons.org/licenses/by/1-0/", "Disclaimer": null, "TopicCategoryCode": [ "inlandWaters" ], "Keywords": [ "Water", "Phosphorus", "Citizen Science", "Algae", "Chlorophyll" ], "CreateTimestamp": "2022-04-30 15:48:48.852129+00" } }

yemoski commented 1 year ago

This now depends on API-based indexing for Gleaner DONE IN Q4 (JANUARY - MARCH) 2023

yemoski commented 1 year ago

We could also get json from the API directly, which is cool. CCADI works the same way, and so does DataCite. If I can add an extension to Gleaner that can take an API key (DataStream requires one) and handle results, then we could do away with some of our sitemap building.