vlraymond commented 6 years ago

I want to be able batch search multiple NSF award #s to see if the data from existing awards is in ADC

From Jeannette:

Check out the intro to Solr chapter in our training manual: https://nceas.github.io/datateam-training/introduction-to-solr.html

From Chris

In browser:

Background:

Solr is a web-based indexing system
The Lucene text-based library, however Solr is the web-based capacity.
we index a subset of all fields available in EML. This is mapped across diffferent metadata standards across different repositories.

Preliminary Queries:

To display capabilities of the member node including Core capabilities, storage capabilities, replication capabilities. MN read includes Query API arcticdata.io/metacat/d1/mn/v2/node

Use this to see what query capabilities there are arcticdata.io/metacat/d1/mn/v2/query

To list all fields and descriptions for ADC arcticdata.io/metacat/d1/mn/v2/query/solr

Query strings, what do the bits and pieces mean

? = sending parameters or value pairs
: = "field":"value"
fl = return fields
rows = number of returns wt = output format +-obsoletedBy: = remove obsolete versions beginDate:[YYY-MM-DDT"HR:MN"Z%20TO%20YYYY-MM-DDT"HR:MN %20 = url encoded space origin = creator of metadata / dataset originator = investigator or organization name "*" = all "+" = space

Examples

Query all fields, all values
arcticdata.io/metacat/d1/mn/v2/query/solr/?q=*,*

Query title field that contains word "soil"
arcticdata.io/metacat/d1/mn/v2/query/solr/?q=title:*soil*

Query title field that contains word "soil" with only title and ID returned
arcticdata.io/metacat/d1/mn/v2/query/solr/?q=title:*soil*&fl=id,title

Query title field that contains word "soil" with only title returned, 100 rows, in .json format arcticdata.io/metacat/d1/mn/v2/query/solr/?q=title:*soil*&fl=title&rows=100&wt=json

Command line: curl to a web server, query below will return only origin as .csv curl "https://arcticdata.io/metacat/d1/mn/v2/query/solr/?q=origin:*&fl=origin&wt=csv"

curl to web server, query below will return 5000 origin names as .csv curl "https://arcticdata.io/metacat/d1/mn/v2/query/solr/?q=origin:*&fl=origin&rows=5000&wt=csv"

curl to web server, query below will return 5000 origin names as .csv for "Grebmeier" curl "https://arcticdata.io/metacat/d1/mn/v2/query/solr/?q=origin:*&fl=origin&rows=5000&wt=csv" | grep Grebmeier

curl to webserver, query below will count the lines for "Grebmeier" curl "https://arcticdata.io/metacat/d1/mn/v2/query/solr/?q=origin:*&fl=origin&rows=5000&wt=csv" | grep Grebmeier | wc -l

curl to server, query below will sort and return unique "origin" items under Grebmeier, and count the them curl "https://arcticdata.io/metacat/d1/mn/v2/query/solr/?q=origin:*&fl=origin&rows=5000&wt=csv" | grep Grebmeier | sort | uniq | wc

query below will return all unique "origin" items in arctcdata.io curl "https://arcticdata.io/metacat/d1/mn/v2/query/solr/?q=origin:*&fl=origin&rows=5000&wt=csv" | sort | uniq

query for ID and return identifier on the document curl "https://arcticdata.io/metacat/d1/mn/v2/query/solr/?q=id:*&fl=id&rows=5000&wt=csv" | sort | uniq

How to pull a list of award numbers in ADC:

Gameplan:

get all EML docs not obsoleted
download each one,
run through an XML processor to pull funding section
Use XML starlet to process xml on the fly to pull out information needed

Getting set up

install homebrew - https://brew.sh/
install atom - https://atom.io/ (for some stuffs later on....)
Once homebrew is installed, run the following command using Terminal / command line to install xmlstarlet brew install xmlstarlet

Test xml starlet with one record:
curl "https://arcticdata.io/metacat/d1/mn/v2/object/doi:10.18739/A2MS0P"

To tidy up outputs use the "fo" command:
curl "https://arcticdata.io/metacat/d1/mn/v2/object/doi:10.18739/A2MS0P" | xmlstarlet fo

To drill down to funding, tell xmlstarlet what part of EML to look in:
curl -s "https://arcticdata.io/metacat/d1/mn/v2/object/doi:10.18739/A2MS0P" | xmlstarlet sel -t -v "/eml:eml/dataset/project/funding/para" -n

But wait there's more

(NB: this is for publicly available items)

Launch atom by typing atom and pressing return in terminal / command line

Copy in the following script


#!/bin/bash
mn_url="https://arcticdata.io/metacat/d1/mn/v2";
query_endpoint="/query/solr";
object_endpoint="/object";
solr_query="?q=formatType:METADATA+AND+-obsoletedBy:*+AND+id:doi\:10\.18739*\
&rows=10000\
&fl=id\
&wt=csv";

identifiers=$(curl -s "${mn_url}${query_endpoint}/${solr_query}");

let count=0;

For each identifier, download the EML and process it

for identifier in $identifiers; do

Skip the first line

if [[ "$identifier" == "id" ]]; then continue; fi count=$(( count + 1 )); echo "${count}) ${identifier}";

Call the DataONE MNRead.get() call to grab the EML

xml=$(curl -s "${mn_url}${object_endpoint}/${identifier}"); xml=$(xmlstarlet fo <<< ${xml});

echo "${xml}";

Use xmlstarlet to find the elements in each EML document to get the award number

award_numbers=$(xmlstarlet sel -n -t -v "/eml:eml/dataset/project/funding/para" <<< ${xml}); if [[ "$award_numbers" != "" ]]; then printf "%s\n" "$award_numbers"; else printf "%s\n" "No funding element found"; fi done


- Save this script as scriptname.sh where "scriptname" is the name you've chosen
- In the terminal navigate to the folder your script is saved in (for example /Documents/projects) and type the following command to make the script executable
```chmod x scriptname ```
- Ok now you should be able to run your script from the commandline by typing
```./scriptname.sh```
- if you want to run the command and have it output to a file, say a .csv for example, run this:  
```./scriptname.sh > file.csv```

et voila.

#### R process:

vlraymond commented 6 years ago

from Chris: After our quick walkthrough of Solr yesterday, I wanted to point out the official documentation in case you need to look up the details: http://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-5.2.pdf . This guide is for Solr server admins (installation, setup, etc), but starting on page 241 there is a section called Query Syntax and Parsing that has all the gory details for parameters and values you can use. We basically discussed the q= parameter from the Standard Query Parser discussed on page 247. Also, Dave Vieglais put together a query tool to help learn the syntax: https://examples.dataone.org/querycn.html - this one queries the CN, but maybe it’s helpful.

csjx commented 6 years ago

Hi @vlraymond : A quick note for your writeup: the Solr syntax uses a colon to delimit field names and values in the query string, and I think in the beginning of your notes you wrote , = "field" , "value" with an example of arcticdata.io/metacat/d1/mn/v2/query/solr/?q=*,* (which should be q=*:*). The rest of the notes use the colon though. Cheers.

kameyer commented 6 years ago

Note: In-house training/learning (for VR & KM)

NCEAS / arctic-data-outreach

training on how to query the ADC #8

From Jeannette:

From Chris

In browser:

Background:

Preliminary Queries:

Query strings, what do the bits and pieces mean

Examples

How to pull a list of award numbers in ADC:

Getting set up

But wait there's more

For each identifier, download the EML and process it

Skip the first line

Call the DataONE MNRead.get() call to grab the EML

echo "${xml}";

Use xmlstarlet to find the elements in each EML document to get the award number