ess-acppo / ag-bie

This repository has the code to the agriculture's implementation of ALA BIE
1 stars 4 forks source link

verify generated/imported SOLR data against input CSV (missing records) #9

Closed mbohun closed 4 years ago

mbohun commented 6 years ago

@ess-acppo-djd identified 5 missing records between the input tblBiota_20180620.csv and the generated SOLR index.

mbohun commented 6 years ago

check_tblBiota.sh

#!/bin/bash                                                                                                                                                                  

# extract the first column values from the CSV file, and remove the enclosing double-quotes                                                                                  
for intBiotaID in `cat tblBiota_20180620.csv | cut -d ',' -f1 | sed -e 's/"//g'`                                                                                             
do                                                                                                                                                                           
    # NOTE: you need curl -L (in order to follow HTTP 301 redirects to the linked record-s)                                                                                  
    #       (for example intBiotaID=106779 redirect to other record)                                                                                                         
    json=`curl -s -L --header 'Accept: application/json' "https://ag-bie.oztaxa.com/ws/species/${intBiotaID}"`                                                               
    if [ "`echo ${json} | jq '. | has("error")'`" == "true" ]; then                                                                                                          
        echo "TEST: ${intBiotaID} error => `echo ${json} | jq '.error'`"                                                                                                     
    fi                                                                                                                                                                       
done
ubuntu@ip-172-31-2-29:/tmp$ ./check_tblBiota.sh
TEST: intBiotaID error => "Not Found"
TEST: 102340 error => "Not Found"
TEST: 103926 error => "Not Found"
TEST: 71079 error => "Not Found"
TEST: 112099 error => "Not Found"
TEST: 30 error => "Not Found"

details of the above 5 records are as follows:

"intBiotaID","intParentID","vchrEpithet","vchrFullName","vchrYearOfPub","vchrAuthor","vchrNameQualifier","chrElemType","vchrRank","chrKingdomCode","intOrder","vchrParentage","bitChangedComb","bitShadowed","bitUnplaced","bitUnverified","bitAvailableName","bitLiteratureName","dtDateCreated","vchrWhoCreated","dtDateLastUpdated","vchrWhoLastUpdated","txtDistQual","GUID"
"102340","20","Phytobiota","","","","","KING ","","P ","0","\20\102340","False","False","False","False","True","False","2003-07-28 11:33:17.857000000","Clayton Winter","2003-07-28 11:33:24.997000000","Clayton Winter","","{9B626B79-DE67-4B58-849C-2B5429F9A83B}"
"103926","64792","Xyleutes eucalypti: Walker [misspelling!]","Xyleutes eucalypti: Walker [misspelling!]","","","","SP   ","","A ","0","\1\106786\6\100975\12\52112\101129\101130\101134\58791\74799\64792\103926","False","False","False","False","False","True","2004-09-27 12:48:37.270000000","graham brown","2004-09-27 12:48:40.630000000","graham brown","","{4F19BBB1-4097-4804-9B48-2F6E1394B4AF}"
"71079","66889","hirtus","Croton hirtus L’herit","","L’herit","","SP   ","","P ","0","\20\102341\102343\101427\21\22\102360\99968\66575\66889\71079","False","False","False","False","False","False","2003-03-25 12:54:09.450000000","Migration","2004-04-07 21:19:27.373000000","sa","","{51ABE293-3031-4310-894B-2353BF4C32E8}"
"112099","101848","Ornithogalum Mosaic Virus","Potyvirus (definitive_species) Ornithogalum Mosaic Virus Smith and Brierley, 1944a","1944a","Smith and Brierley","","SP   ","","V ","0","\101171\101661\104483\61073\61217\101848\112099","False","False","False","False","False","False","2016-09-05 10:37:37.967000000","NAQSTaxaTree","2016-09-05 13:47:30.587000000","AGDAFF\Teakle Graham","","{C0B11D33-42CD-4A55-A410-863A2A0CFD87}"
"30","106089","<No_Species_Entered>","<No_Species_Entered>","","","","     ","","A ","0","\24\106089\30","False","False","False","False","False","False","2003-03-25 12:54:09.450000000","Data Conversion","2007-06-12 12:29:30.250000000","Graham Brown","","{7EB978EA-7584-4285-9DA2-D66FAE5F1B3D}"
charvolant commented 6 years ago

Some of these are being rejected early by the talend processing. They can be found in /data/work/taxxas/Processed/rejected.csv (theres also a vernacular_rejected.csv). The sanity checking rules may be over strict.

ess-acppo-djd commented 6 years ago

I've already located these and am preparing to have the source data corrected. They're appear to be rejected for using unexpected characters in one of FullName, Epithet, Author or YearOfPub. There is one other record being dropped somewhere (Phytobiota, a synonym for Plantae) and I've yet to hunt it down.

ess-acppo-djd commented 6 years ago

It gets stripped out into 'invalid_synonyms.csv' by the process that creates the directory /data/work/taxxas/DwC

moziauddin commented 4 years ago

Test script is already added. The test script can check what names are missing uaing ID or name.