Open rsignell-usgs opened 8 years ago
Here's one approach that we use in IOOS.
conda install -c conda-forge pycsw
or use the docker container here: https://hub.docker.com/r/axiom/docker-pycsw/I agree this is a great thing to work on. The approach here at Unidata has been that "search" (Data Discoverability) encompasses so much and there are many experts in that area, of which we are not. Rather than do yet another one off solution, we decided to work with those experts to ensure that the TDS can provide the information they need to do their magic. From what I understand, ISO metadata has been most useful. Now, it seems there are standard services that can suck in the iso metadata and provide pretty nice search and discoverability capability.
It seems to me that we are at (or past) the point where we at Unidata should be reaching out to the community, as well as do some in-house evaluations, to see if there is a solution that we could recommend for use with the TDS.
One obvious solution would be pyCSW, which I know you've worked with. Do you think that would be a good place to start? Note that here I consider any brokering solutions, such as GI-CAT, to be a separate topic.
Ok, I think we should start by evaluating the IOOS workflow. Opinions?
Here's an actual example that uses the harvester, a script that harvests datasets from
http://thredds.ucar.edu/thredds/catalog.xml
https://gist.github.com/kwilcox/60b8a3e771987f96adf0c6b1e77ede24
Elsewhere I've been having a discussion about thredds_crawler + siphon, but first we need to do something about thredds_crawler's license: GPL 😱
Ouch...yeah, that's a problem. :disappointed:
@kwilcox, would it be a big deal to change to another license?
@dopplershift , what do you prefer, MIT?
@kwilcox already said in email "That really isn't the correct license for thredds_crawler. NOAA/IOOS should figure that out with RPS before we move forward with using it for anything. IMO it should be public domain."
My preference is anything permissive--I usually go MIT or BSD 3-clause.
To be clear, my problem with GPL is that anything "derived" from it, which even includes me looking at the code for ideas, would have to then be GPL as well.
@dpsnowden, @shane-axiom, @lukecampbell, any reason we couldn't do MIT license here, or CC0 (which we've been recommended to use for government-developed software...)?
IANAL
I can't comment on the thredds-crawler thing, that's above my pay grade. But, public domain for software that was developed by and distributed by a non-government entity is dangerous because it opens up avenues for liability. Which is why the majority of permissive licenses just contain limited liability clauses, and some include attribution requirements.
I would prefer to see MIT as well. I've brought it up, and discussions are taking place outside of my realm of responsibility.
And, you're right @dopplershift about GPL, it's like an open source infection. Anything that touches it, must be GPL (few exceptions which I'll omit for brevity). If the license is changed, any derivative software or linked software can become more permissive like https://github.com/axiom-data-science/thredds_iso_harvester
I changed the thredds-iso-harvester license to Unlicense, which is public domain and does include a liability section.
I'd rather not debate copyright law, but technically, and again IANAL, but because thredds-iso-harvester uses thredds_crawler, it's in violation of the license on thredds_crawler currently, as it is currently GPLv3.
That's why they can't use thredds_crawler in siphon, because it's currently licensed under GPLv3.
Yes, good point. I reverted thredds-iso-harvester
to GPL 3 for now. Cue Kafka.
Can you ping this issue if/when thredds_crawler
gets a license update?
I'm hopeful that the license will be changed soon.
@shane-axiom We moved the thredds_crawler project from asascience-open to ioos and changed the license to MIT.
That's great. Thanks guys! 🎉
@lukecampbell Thanks Luke, I updated thredds-iso-harvester
's license to MIT as well.
@lesserwhirls , @dopplershift , I'm guessing this has slipped off the radar screen, but here's an example of how easy it is to harvest the ISO records from Unidata datasets.
This example harvests the ISO records from "Best" time series forecast models using Axiom's docker container for the thredds_iso_harvester:
$ do_harvest unidata.py
where do_harvest
is:
#!/bin/bash
docker run --rm -v $(pwd)/$1:/srv/harvest.py -v $(pwd)/iso:/srv/iso \
axiom/thredds_iso_harvester
and unidata.py
is:
from thredds_iso_harvester.harvest import ThreddsIsoHarvester
from thredds_crawler.crawl import Crawl
skip = Crawl.SKIPS
select = ['.*\/Best']
ThreddsIsoHarvester(catalog_url="http://thredds.ucar.edu/thredds/idd/forecastMod
els.xml",
skip=skip, select=select,
out_dir="/srv/iso/unidata")
Running this script should take just 1 or 2 minutes, and will create 50+ ISO records in a ./iso/unidata
subdirectory.
The beauty of this technique is that you don't need to have a custom python environment, or even any python! You just need Docker.
@lesserwhirls , @dopplershift , I'm guessing this has slipped off the radar screen, but here's an example of how easy it is to harvest the ISO records from Unidata datasets.
In part, yes; in other part, several of our machines run SunOS
, and running a python stack on that can be quite...ummm..what's the word I'm looking for, @dopplershift? And Docker? Fuhgeddaboudit. If you were to do a demo at the spring user comm showing what kind of search capabilities this enables, that would be awesome!
If you were to do a demo at the spring user comm showing what kind of search capabilities this enables, that would be awesome!
@lesserwhirls, I'd love to give a demo of harvesting multiple thredds catalogs, then querying the catalog using a Jupyter notebook and then TerriaJS.
Only problem is that I already asked to give a presentation on ERDDAP for obs data. Would it be too much to do both?
Here's an example of exploring some of the Unidata thredds forecast models via with datasets dynamically populated via a CSW query to the IOOS catalog:
Jupyter Example: https://gist.github.com/anonymous/0a3a8ec292a4a480a0c01b89ef3a297e
TerriaJS Example: http://gamone.whoi.edu/terriajs/#clean&proxy/_60s/https://raw.githubusercontent.com/USGS-CMG/terriajs-dive/master/examples/csw_unidata.json
Here at the September Unidata Users Committee meeting, Unidata Director Mohan listed "Data Discoverability" as a major potential theme for the 2016 Strategic Plan. I agree this would be a great thing to work on, and Unidata is in a great position to do this because they already have many THREDDS servers out in the community serving data with ncISO services available to create ISO metadata. And there are many catalog services that can ingest ISO metadata and provide standardized CSW or OpenSearch catalog interfaces.