The POLDER Federated Search was originally developed by the World Data System International Technology Office between 2021-2023 in response to needs identified by the POLDER Working Group. It is currently deployed at https://search.polder.info.
For more comprehensive documentation, see this book.
There are two ways a repository is included in searches that are made from this application- if the repository is indexed DataONE (and has data within the search parameters), and if the repository is indexed by the app itself.
There are many DataONE repositories, but of particular interest to the polar research community are:
docker/build-bas-sitemap.sh
for how that works)This tool uses Docker images to manage the different services that it depends on. One of those is Gleaner.
The web app itself that hosts the UI and does the searches is built using Flask, which is a Python web framework. I chose Python because Python has good support for RDF and SPARQL operations with RDFLib. The frontend dependencies are HTML, JavaScript and SCSS, built using Parcel. The maps in the user interface are built using OpenLayers.
Errors in deployed versions of this project are collected with Sentry.
A pre-built image for the web app is on Docker Hub as wdsito/polder-federated-search, and that is what all of the Helm/Kubernetes and Docker files in this repository are referencing. If you want to modify this project and build your own ones, you're welcome to.
There is also a sitemap-building step for some of the data repositories that don't have sitemaps that work in the way we want (e.g. they don't have sitemaps, we wanted to scope down the datasets crawled to just polar data, or some other reason). That step uses a purpose-built Docker image, and the code for that is in build-sitemap
in this repository.
Images are automatically built with Github Actions, and tagged with the version specified in package.json
(in this directory). If you want to deploy a new version of the site, remember to incrememt the version in package.json
and in helm/Chart.yaml
. Once the Github Action completes, the new website image will be ready for you to deploy.
There is a directory called deployment-support
, which has files that both Docker and Helm / Kubernetes can use to configure Gleaner and Graphdb.
docker.yaml
is what Docker uses to configure Gleaner when you run it using docker compose
. Logs will go into this folder as well as other files associated with a Gleaner run.
Files in here are used by both Docker and Helm / Kubernetes to work with GraphDB. There are shell scripts to set up, clear, and write to GraphDB, as well as various settings files.
The file EXAMPLE-graphdb-users.js
is standing in for a file that you should not check into source control - graphdb-users.js
. The reason to not check it in is because it contains password hashes. You can either generate bcrypt
-ed password hashes using a tool like this one or start a GraphDB instance, create the users you want (remember to reset the admin password too, it's 'root' by default), and then download the users.js
file, which is at /opt/graphdb/home/data/users.js
. You could use the GraphDB image referenced in docker-compose.yaml
for this purpose. I recommend doing the following:
docker run -p 127.0.0.1:7200:7200 -t ontotext/graphdb:10.2.0
(substitute the appropriate image version there)cat /opt/graphdb/home/data/users.js
Don't forget to set the matching passwords in your .env
file as well.
The GraphDB documentation may be of use to you here.
Assuming that you're starting from this directory:
dev.env
- copy it to .env
and fill in the correct values for you. Save the file and then run source .env
.cd docker
docker-compose up -d
docker-compose --profile setup up -d
in order to start all of the necessary services and set up Gleaner for indexing.docker
directory): docker-compose --profile crawl up -d
docker-compose --profile write up -d
docker-compose --profile web up -d
If you're using Docker Desktop, you can use the UI to open the docker-webapp image in a browser.
If you ever need to remove everything from the triplestore and start over, you can run ./clear-triplestore.sh
.
brew install helm
), or visit the Helm website for instructions. helm upgrade --install ingress-nginx ingress-nginx \
--repo https://kubernetes.github.io/ingress-nginx \
--namespace ingress-nginx --create-namespace
You may need some additional steps for minikube or MicroK8s - see the ingress-nginx documentation for more details.
helm/templates
, create a file named secrets.yaml
. It's listed in .gitignore
, so it won't get checked in.
That file will be structured like this: apiVersion: v1
kind: Secret
metadata:
name: {{ .Release.Name }}-secrets
data:
minioAccessKey: <your base64 encoded value here>
minioSecretKey: <your base64 encoded value here>
flaskSecretKey: <your base64 encoded value here>
sentryDsn: <your base64 encoded value here>
graphdbIndexerPassword: <your base64 encoded value here>
graphdbRootPassword: <your base64 encoded value here>
You can see that the values of the secrets are base64 encoded - in order to do this, run echo -n 'mysecretkey' | base64
on your command line for each value, and paste the result in where the key goes. Don't check in your real secrets anywhere!
You can read more about secrets here.
In order to deploy to the dev
or prod
clusters, which are currently hosted in DataONE's analagous Kubernetes clusters, you need to ask someone in that organization for their Kubernetes config information. Name that file polder.config
and put it in this directory; it'll get added to your environment automatically.
Assuming that you're starting from this directory, you can run:
helm install polder ./helm -f helm/values-local.yaml
to deploy a cluster to a docker-desktop Kubernetes instance.
Some notes: the polder
can be replaced with whatever you want. For a dev or prod environment deploy, you need to first be using the correct Kubernetes context (kubectl get-contexts
can tell you which ones are available to you). For dev, use values-dev.yaml
instead of values-local.yaml
and for a production deploy, use values-prod.yaml
. Note that values-dev
and values-prod
are currently set up to deploy in DataONE's dev and prod Kubernetes clusters. They will not work without the correct keys and permissions from DataONE.
The cluster will take a few minutes to spin up. In addition to downloading all these Docker images and starting the web app, it does the following:
If you're using Docker desktop for all this, you can visit http://localhost and see it running!
The Helm chart also includes a Kubernetes CronJob
that tells Gleaner to index once a week. You can see it at helm/templates/crawl.yaml
.
In addition, there's a CronJob
that is set to run on the 30th of February, at helm/templates/recreate-index.yaml
. This is a terrible hack to get around the fact that you cannot include a job in a Helm chart without it being automatically run when you deploy the chart. I wanted a way to remove all of the indexed files and recreate the triplestore without having to do a bunch of manual steps. In order to run this job, you can do kubectl create job --from=cronjob/recreate-index reindex
- but do note that it will delete and recreate the entire index.
Note that the 30th of February has happened at least twice, but given the other circumstances under which it occurred, I'm guessing that a federated search reindex will be the least of your worries.
Take a look at helm/values-*.yaml
to customize this setup. Pay particular attention to persistence
at the bottom; if you're running locally, you probably want existing: false
in there.
Go to the site and do a search for something like "Greenland", "ice" or "penguin" - those each have lots of results from both DataONE and the triplestore.
If you open a web inspector on the results, you can look for web elements with class="result"
. POLDER-crawled results have a data-source
attribute that’s set to Gleaner
, and DataONE results have data-source="DataONE"
. If you have both types, congratulations! You have a working federated search.
I'd love for people to use this for all kinds of scientific data repository searches - please feel free to fork it, submit a PR, or ask questions. The Customization section of the book will be particularly useful to you.
If you use the Github Actions (see the Images and Versions section above) to automatically build and push Docker images to the WDS-ITO Docker hub, you'll need to update the versions in package.json
, as well as helm/Chart.yaml
and the versions in docker/docker-compose.yaml
in order to get the images with your latest code.
To build the Docker image for the web app, run docker image build .
. For multi-architecture support, run docker buildx build --no-cache --pull --platform=linux/arm64,linux/amd64 .
.
Assuming that you're starting from this directory:
The easiest setup for development on the web app itself is to use docker-compose for the dependencies, like Gleaner and GraphDB (docker-compose up -d
), and run the app itself directly in Flask. To do that, follow the steps in the Deployment -> Docker section above, but skip the last one. Instead, do:
cd ../
(to get back to this directory)source venv/bin/activate
pip install -r requirements.txt
npm install --global yarn
(assuming you do not have yarn installed)yarn install
yarn watch
(assuming that you want to make JavaScript or CSS changes - if not, yarn build
will do)flask run
You should see Flask's startup message, and get an address for your locally running web app.
This project originally used Blazegraph instead of GraphDB. We changed because we wanted GraphDB's GeoSPARQL support and nice development console - but GraphDB is not open source, although a free version is available. If you wish to build a project that only has open-source software in it, you can use Blazegraph instead. See https://polder-crew.github.io/Federated-Search-Documentation/blazegraph.html for detailed instructions.
This app includes Python unit tests! To run them from this directory, do python -m unittest
after you activate your virtual environment.
Adding or updating tests as part of any contribution you make is strongly encouraged.
The SCSS styles are built assuming a mobile-first philosophy. Here are the breakpoints, for your convenience; they are also in _constants.scss
.
There are also some special map styles, in app/static/maps
; you can read more about how they work in that directory.
The file docker_performance.ipynb
is meant to give you nifty graphs of the resources used by this app, broken out by docker container. You can use this method to look at the Kubernetes cluster performance too.
To set it up from this directory:
source venv/bin/activate
pip install -r requirements-performance.txt
docker stats
command to write csv files like so: while true; do docker stats --no-stream | cat >> ./
date -u +"%Y%m%d.csv"; sleep 10; done
jupyter notebook
in this directory