training on how to query the ADC #8

Closed vlraymond closed 4 years ago

vlraymond commented 6 years ago

I want to be able batch search multiple NSF award #s to see if the data from existing awards is in ADC

From Jeannette:

Check out the intro to Solr chapter in our training manual:

From Chris

In browser:


Preliminary Queries:

To display capabilities of the member node including Core capabilities, storage capabilities, replication capabilities. MN read includes Query API

Use this to see what query capabilities there are

To list all fields and descriptions for ADC

Query strings, what do the bits and pieces mean

? = sending parameters or value pairs
: = "field":"value"
fl = return fields
rows = number of returns wt = output format +-obsoletedBy: = remove obsolete versions beginDate:[YYY-MM-DDT"HR:MN"Z%20TO%20YYYY-MM-DDT"HR:MN %20 = url encoded space origin = creator of metadata / dataset originator = investigator or organization name "*" = all "+" = space


Query all fields, all values*,*

Query title field that contains word "soil"*soil*

Query title field that contains word "soil" with only title and ID returned*soil*&fl=id,title

Query title field that contains word "soil" with only title returned, 100 rows, in .json format*soil*&fl=title&rows=100&wt=json

Command line: curl to a web server, query below will return only origin as .csv curl "*&fl=origin&wt=csv"

curl to web server, query below will return 5000 origin names as .csv curl "*&fl=origin&rows=5000&wt=csv"

curl to web server, query below will return 5000 origin names as .csv for "Grebmeier" curl "*&fl=origin&rows=5000&wt=csv" | grep Grebmeier

curl to webserver, query below will count the lines for "Grebmeier" curl "*&fl=origin&rows=5000&wt=csv" | grep Grebmeier | wc -l

curl to server, query below will sort and return unique "origin" items under Grebmeier, and count the them curl "*&fl=origin&rows=5000&wt=csv" | grep Grebmeier | sort | uniq | wc

query below will return all unique "origin" items in curl "*&fl=origin&rows=5000&wt=csv" | sort | uniq

query for ID and return identifier on the document curl "*&fl=id&rows=5000&wt=csv" | sort | uniq

How to pull a list of award numbers in ADC:


Getting set up

Test xml starlet with one record:
curl ""

To tidy up outputs use the "fo" command:
curl "" | xmlstarlet fo

To drill down to funding, tell xmlstarlet what part of EML to look in:
curl -s "" | xmlstarlet sel -t -v "/eml:eml/dataset/project/funding/para" -n

But wait there's more

(NB: this is for publicly available items)

identifiers=$(curl -s "${mn_url}${query_endpoint}/${solr_query}");

let count=0;

For each identifier, download the EML and process it

for identifier in $identifiers; do

Skip the first line

if [[ "$identifier" == "id" ]]; then continue; fi count=$(( count + 1 )); echo "${count}) ${identifier}";

Call the DataONE MNRead.get() call to grab the EML

xml=$(curl -s "${mn_url}${object_endpoint}/${identifier}"); xml=$(xmlstarlet fo <<< ${xml});

echo "${xml}";

Use xmlstarlet to find the elements in each EML document to get the award number

award_numbers=$(xmlstarlet sel -n -t -v "/eml:eml/dataset/project/funding/para" <<< ${xml}); if [[ "$award_numbers" != "" ]]; then printf "%s\n" "$award_numbers"; else printf "%s\n" "No funding element found"; fi done

- Save this script as where "scriptname" is the name you've chosen
- In the terminal navigate to the folder your script is saved in (for example /Documents/projects) and type the following command to make the script executable
```chmod x scriptname ```
- Ok now you should be able to run your script from the commandline by typing
- if you want to run the command and have it output to a file, say a .csv for example, run this:  
```./ > file.csv```

et voila.

#### R process:
vlraymond commented 6 years ago

from Chris: After our quick walkthrough of Solr yesterday, I wanted to point out the official documentation in case you need to look up the details: . This guide is for Solr server admins (installation, setup, etc), but starting on page 241 there is a section called Query Syntax and Parsing that has all the gory details for parameters and values you can use. We basically discussed the q= parameter from the Standard Query Parser discussed on page 247. Also, Dave Vieglais put together a query tool to help learn the syntax: - this one queries the CN, but maybe it’s helpful.

csjx commented 6 years ago

Hi @vlraymond : A quick note for your writeup: the Solr syntax uses a colon to delimit field names and values in the query string, and I think in the beginning of your notes you wrote , = "field" , "value" with an example of*,* (which should be q=*:*). The rest of the notes use the colon though. Cheers.

kameyer commented 6 years ago

Note: In-house training/learning (for VR & KM)