NCEAS / arctic-data-outreach

Outreach and engagement activities for the Arctic Data Center
Apache License 2.0
3 stars 4 forks source link

training on how to query the ADC #8

Closed vlraymond closed 4 years ago

vlraymond commented 6 years ago

I want to be able batch search multiple NSF award #s to see if the data from existing awards is in ADC

From Jeannette:

Check out the intro to Solr chapter in our training manual: https://nceas.github.io/datateam-training/introduction-to-solr.html

From Chris

In browser:

Background:

Preliminary Queries:

To display capabilities of the member node including Core capabilities, storage capabilities, replication capabilities. MN read includes Query API arcticdata.io/metacat/d1/mn/v2/node

Use this to see what query capabilities there are arcticdata.io/metacat/d1/mn/v2/query

To list all fields and descriptions for ADC arcticdata.io/metacat/d1/mn/v2/query/solr

Query strings, what do the bits and pieces mean

? = sending parameters or value pairs
: = "field":"value"
fl = return fields
rows = number of returns wt = output format +-obsoletedBy: = remove obsolete versions beginDate:[YYY-MM-DDT"HR:MN"Z%20TO%20YYYY-MM-DDT"HR:MN %20 = url encoded space origin = creator of metadata / dataset originator = investigator or organization name "*" = all "+" = space

Examples

Query all fields, all values
arcticdata.io/metacat/d1/mn/v2/query/solr/?q=*,*

Query title field that contains word "soil"
arcticdata.io/metacat/d1/mn/v2/query/solr/?q=title:*soil*

Query title field that contains word "soil" with only title and ID returned
arcticdata.io/metacat/d1/mn/v2/query/solr/?q=title:*soil*&fl=id,title

Query title field that contains word "soil" with only title returned, 100 rows, in .json format arcticdata.io/metacat/d1/mn/v2/query/solr/?q=title:*soil*&fl=title&rows=100&wt=json

Command line: curl to a web server, query below will return only origin as .csv curl "https://arcticdata.io/metacat/d1/mn/v2/query/solr/?q=origin:*&fl=origin&wt=csv"

curl to web server, query below will return 5000 origin names as .csv curl "https://arcticdata.io/metacat/d1/mn/v2/query/solr/?q=origin:*&fl=origin&rows=5000&wt=csv"

curl to web server, query below will return 5000 origin names as .csv for "Grebmeier" curl "https://arcticdata.io/metacat/d1/mn/v2/query/solr/?q=origin:*&fl=origin&rows=5000&wt=csv" | grep Grebmeier

curl to webserver, query below will count the lines for "Grebmeier" curl "https://arcticdata.io/metacat/d1/mn/v2/query/solr/?q=origin:*&fl=origin&rows=5000&wt=csv" | grep Grebmeier | wc -l

curl to server, query below will sort and return unique "origin" items under Grebmeier, and count the them curl "https://arcticdata.io/metacat/d1/mn/v2/query/solr/?q=origin:*&fl=origin&rows=5000&wt=csv" | grep Grebmeier | sort | uniq | wc

query below will return all unique "origin" items in arctcdata.io curl "https://arcticdata.io/metacat/d1/mn/v2/query/solr/?q=origin:*&fl=origin&rows=5000&wt=csv" | sort | uniq

query for ID and return identifier on the document curl "https://arcticdata.io/metacat/d1/mn/v2/query/solr/?q=id:*&fl=id&rows=5000&wt=csv" | sort | uniq

How to pull a list of award numbers in ADC:

Gameplan:

Getting set up

Test xml starlet with one record:
curl "https://arcticdata.io/metacat/d1/mn/v2/object/doi:10.18739/A2MS0P"

To tidy up outputs use the "fo" command:
curl "https://arcticdata.io/metacat/d1/mn/v2/object/doi:10.18739/A2MS0P" | xmlstarlet fo

To drill down to funding, tell xmlstarlet what part of EML to look in:
curl -s "https://arcticdata.io/metacat/d1/mn/v2/object/doi:10.18739/A2MS0P" | xmlstarlet sel -t -v "/eml:eml/dataset/project/funding/para" -n

But wait there's more

(NB: this is for publicly available items)

identifiers=$(curl -s "${mn_url}${query_endpoint}/${solr_query}");

let count=0;

For each identifier, download the EML and process it

for identifier in $identifiers; do

Skip the first line

if [[ "$identifier" == "id" ]]; then continue; fi count=$(( count + 1 )); echo "${count}) ${identifier}";

Call the DataONE MNRead.get() call to grab the EML

xml=$(curl -s "${mn_url}${object_endpoint}/${identifier}"); xml=$(xmlstarlet fo <<< ${xml});

echo "${xml}";

Use xmlstarlet to find the elements in each EML document to get the award number

award_numbers=$(xmlstarlet sel -n -t -v "/eml:eml/dataset/project/funding/para" <<< ${xml}); if [[ "$award_numbers" != "" ]]; then printf "%s\n" "$award_numbers"; else printf "%s\n" "No funding element found"; fi done


- Save this script as scriptname.sh where "scriptname" is the name you've chosen
- In the terminal navigate to the folder your script is saved in (for example /Documents/projects) and type the following command to make the script executable
```chmod x scriptname ```
- Ok now you should be able to run your script from the commandline by typing
```./scriptname.sh```
- if you want to run the command and have it output to a file, say a .csv for example, run this:  
```./scriptname.sh > file.csv```

et voila.

#### R process:
vlraymond commented 6 years ago

from Chris: After our quick walkthrough of Solr yesterday, I wanted to point out the official documentation in case you need to look up the details: http://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-5.2.pdf . This guide is for Solr server admins (installation, setup, etc), but starting on page 241 there is a section called Query Syntax and Parsing that has all the gory details for parameters and values you can use. We basically discussed the q= parameter from the Standard Query Parser discussed on page 247. Also, Dave Vieglais put together a query tool to help learn the syntax: https://examples.dataone.org/querycn.html - this one queries the CN, but maybe it’s helpful.

csjx commented 6 years ago

Hi @vlraymond : A quick note for your writeup: the Solr syntax uses a colon to delimit field names and values in the query string, and I think in the beginning of your notes you wrote , = "field" , "value" with an example of arcticdata.io/metacat/d1/mn/v2/query/solr/?q=*,* (which should be q=*:*). The rest of the notes use the colon though. Cheers.

kameyer commented 6 years ago

Note: In-house training/learning (for VR & KM)