Unidata / thredds

THREDDS Data Server v4.6
https://www.unidata.ucar.edu/software/tds/v4.6/index.html
266 stars 179 forks source link

Develop search capability for data in multiple THREDDS catalogs #648

Open rsignell-usgs opened 7 years ago

rsignell-usgs commented 7 years ago

Here at the September Unidata Users Committee meeting, Unidata Director Mohan listed "Data Discoverability" as a major potential theme for the 2016 Strategic Plan. I agree this would be a great thing to work on, and Unidata is in a great position to do this because they already have many THREDDS servers out in the community serving data with ncISO services available to create ISO metadata. And there are many catalog services that can ingest ISO metadata and provide standardized CSW or OpenSearch catalog interfaces.

rsignell-usgs commented 7 years ago

Here's one approach that we use in IOOS.

lesserwhirls commented 7 years ago

I agree this is a great thing to work on. The approach here at Unidata has been that "search" (Data Discoverability) encompasses so much and there are many experts in that area, of which we are not. Rather than do yet another one off solution, we decided to work with those experts to ensure that the TDS can provide the information they need to do their magic. From what I understand, ISO metadata has been most useful. Now, it seems there are standard services that can suck in the iso metadata and provide pretty nice search and discoverability capability.

It seems to me that we are at (or past) the point where we at Unidata should be reaching out to the community, as well as do some in-house evaluations, to see if there is a solution that we could recommend for use with the TDS.

One obvious solution would be pyCSW, which I know you've worked with. Do you think that would be a good place to start? Note that here I consider any brokering solutions, such as GI-CAT, to be a separate topic.

lesserwhirls commented 7 years ago

Ok, I think we should start by evaluating the IOOS workflow. Opinions?

rsignell-usgs commented 7 years ago

Here's an actual example that uses the harvester, a script that harvests datasets from http://thredds.ucar.edu/thredds/catalog.xml

https://gist.github.com/kwilcox/60b8a3e771987f96adf0c6b1e77ede24

dopplershift commented 7 years ago

Elsewhere I've been having a discussion about thredds_crawler + siphon, but first we need to do something about thredds_crawler's license: GPL 😱

lesserwhirls commented 7 years ago

Ouch...yeah, that's a problem. :disappointed:

rsignell-usgs commented 7 years ago

@kwilcox, would it be a big deal to change to another license?
@dopplershift , what do you prefer, MIT?

dopplershift commented 7 years ago

@kwilcox already said in email "That really isn't the correct license for thredds_crawler. NOAA/IOOS should figure that out with RPS before we move forward with using it for anything. IMO it should be public domain."

My preference is anything permissive--I usually go MIT or BSD 3-clause.

dopplershift commented 7 years ago

To be clear, my problem with GPL is that anything "derived" from it, which even includes me looking at the code for ideas, would have to then be GPL as well.

rsignell-usgs commented 7 years ago

@dpsnowden, @shane-axiom, @lukecampbell, any reason we couldn't do MIT license here, or CC0 (which we've been recommended to use for government-developed software...)?

lukecampbell commented 7 years ago

IANAL

I can't comment on the thredds-crawler thing, that's above my pay grade. But, public domain for software that was developed by and distributed by a non-government entity is dangerous because it opens up avenues for liability. Which is why the majority of permissive licenses just contain limited liability clauses, and some include attribution requirements.

I would prefer to see MIT as well. I've brought it up, and discussions are taking place outside of my realm of responsibility.

lukecampbell commented 7 years ago

And, you're right @dopplershift about GPL, it's like an open source infection. Anything that touches it, must be GPL (few exceptions which I'll omit for brevity). If the license is changed, any derivative software or linked software can become more permissive like https://github.com/axiom-data-science/thredds_iso_harvester

srstsavage commented 7 years ago

I changed the thredds-iso-harvester license to Unlicense, which is public domain and does include a liability section.

lukecampbell commented 7 years ago

I'd rather not debate copyright law, but technically, and again IANAL, but because thredds-iso-harvester uses thredds_crawler, it's in violation of the license on thredds_crawler currently, as it is currently GPLv3.

lukecampbell commented 7 years ago

That's why they can't use thredds_crawler in siphon, because it's currently licensed under GPLv3.

srstsavage commented 7 years ago

Yes, good point. I reverted thredds-iso-harvester to GPL 3 for now. Cue Kafka.

Can you ping this issue if/when thredds_crawler gets a license update?

lukecampbell commented 7 years ago

I'm hopeful that the license will be changed soon.

lukecampbell commented 7 years ago

@shane-axiom We moved the thredds_crawler project from asascience-open to ioos and changed the license to MIT.

dopplershift commented 7 years ago

That's great. Thanks guys! 🎉

srstsavage commented 7 years ago

@lukecampbell Thanks Luke, I updated thredds-iso-harvester's license to MIT as well.

rsignell-usgs commented 7 years ago

@lesserwhirls , @dopplershift , I'm guessing this has slipped off the radar screen, but here's an example of how easy it is to harvest the ISO records from Unidata datasets.

This example harvests the ISO records from "Best" time series forecast models using Axiom's docker container for the thredds_iso_harvester:

$ do_harvest unidata.py

where do_harvest is:

#!/bin/bash
docker run --rm -v $(pwd)/$1:/srv/harvest.py -v $(pwd)/iso:/srv/iso \
  axiom/thredds_iso_harvester

and unidata.py is:

from thredds_iso_harvester.harvest import ThreddsIsoHarvester
from thredds_crawler.crawl import Crawl

skip = Crawl.SKIPS
select = ['.*\/Best']

ThreddsIsoHarvester(catalog_url="http://thredds.ucar.edu/thredds/idd/forecastMod
els.xml",
    skip=skip, select=select,
    out_dir="/srv/iso/unidata")

Running this script should take just 1 or 2 minutes, and will create 50+ ISO records in a ./iso/unidata subdirectory.

The beauty of this technique is that you don't need to have a custom python environment, or even any python! You just need Docker.

lesserwhirls commented 7 years ago

@lesserwhirls , @dopplershift , I'm guessing this has slipped off the radar screen, but here's an example of how easy it is to harvest the ISO records from Unidata datasets.

In part, yes; in other part, several of our machines run SunOS, and running a python stack on that can be quite...ummm..what's the word I'm looking for, @dopplershift? And Docker? Fuhgeddaboudit. If you were to do a demo at the spring user comm showing what kind of search capabilities this enables, that would be awesome!

rsignell-usgs commented 7 years ago

If you were to do a demo at the spring user comm showing what kind of search capabilities this enables, that would be awesome!

@lesserwhirls, I'd love to give a demo of harvesting multiple thredds catalogs, then querying the catalog using a Jupyter notebook and then TerriaJS.

Only problem is that I already asked to give a presentation on ERDDAP for obs data. Would it be too much to do both?

Here's an example of exploring some of the Unidata thredds forecast models via with datasets dynamically populated via a CSW query to the IOOS catalog:

Jupyter Example: https://gist.github.com/anonymous/0a3a8ec292a4a480a0c01b89ef3a297e

TerriaJS Example: http://gamone.whoi.edu/terriajs/#clean&proxy/_60s/https://raw.githubusercontent.com/USGS-CMG/terriajs-dive/master/examples/csw_unidata.json

2017-03-18_12-50-56 2017-03-18_12-50-17 2017-03-18_12-49-09