Closed mbohun closed 4 years ago
#!/bin/bash
# extract the first column values from the CSV file, and remove the enclosing double-quotes
for intBiotaID in `cat tblBiota_20180620.csv | cut -d ',' -f1 | sed -e 's/"//g'`
do
# NOTE: you need curl -L (in order to follow HTTP 301 redirects to the linked record-s)
# (for example intBiotaID=106779 redirect to other record)
json=`curl -s -L --header 'Accept: application/json' "https://ag-bie.oztaxa.com/ws/species/${intBiotaID}"`
if [ "`echo ${json} | jq '. | has("error")'`" == "true" ]; then
echo "TEST: ${intBiotaID} error => `echo ${json} | jq '.error'`"
fi
done
ubuntu@ip-172-31-2-29:/tmp$ ./check_tblBiota.sh
TEST: intBiotaID error => "Not Found"
TEST: 102340 error => "Not Found"
TEST: 103926 error => "Not Found"
TEST: 71079 error => "Not Found"
TEST: 112099 error => "Not Found"
TEST: 30 error => "Not Found"
details of the above 5 records are as follows:
"intBiotaID","intParentID","vchrEpithet","vchrFullName","vchrYearOfPub","vchrAuthor","vchrNameQualifier","chrElemType","vchrRank","chrKingdomCode","intOrder","vchrParentage","bitChangedComb","bitShadowed","bitUnplaced","bitUnverified","bitAvailableName","bitLiteratureName","dtDateCreated","vchrWhoCreated","dtDateLastUpdated","vchrWhoLastUpdated","txtDistQual","GUID"
"102340","20","Phytobiota","","","","","KING ","","P ","0","\20\102340","False","False","False","False","True","False","2003-07-28 11:33:17.857000000","Clayton Winter","2003-07-28 11:33:24.997000000","Clayton Winter","","{9B626B79-DE67-4B58-849C-2B5429F9A83B}"
"103926","64792","Xyleutes eucalypti: Walker [misspelling!]","Xyleutes eucalypti: Walker [misspelling!]","","","","SP ","","A ","0","\1\106786\6\100975\12\52112\101129\101130\101134\58791\74799\64792\103926","False","False","False","False","False","True","2004-09-27 12:48:37.270000000","graham brown","2004-09-27 12:48:40.630000000","graham brown","","{4F19BBB1-4097-4804-9B48-2F6E1394B4AF}"
"71079","66889","hirtus","Croton hirtus L’herit","","L’herit","","SP ","","P ","0","\20\102341\102343\101427\21\22\102360\99968\66575\66889\71079","False","False","False","False","False","False","2003-03-25 12:54:09.450000000","Migration","2004-04-07 21:19:27.373000000","sa","","{51ABE293-3031-4310-894B-2353BF4C32E8}"
"112099","101848","Ornithogalum Mosaic Virus","Potyvirus (definitive_species) Ornithogalum Mosaic Virus Smith and Brierley, 1944a","1944a","Smith and Brierley","","SP ","","V ","0","\101171\101661\104483\61073\61217\101848\112099","False","False","False","False","False","False","2016-09-05 10:37:37.967000000","NAQSTaxaTree","2016-09-05 13:47:30.587000000","AGDAFF\Teakle Graham","","{C0B11D33-42CD-4A55-A410-863A2A0CFD87}"
"30","106089","<No_Species_Entered>","<No_Species_Entered>","","",""," ","","A ","0","\24\106089\30","False","False","False","False","False","False","2003-03-25 12:54:09.450000000","Data Conversion","2007-06-12 12:29:30.250000000","Graham Brown","","{7EB978EA-7584-4285-9DA2-D66FAE5F1B3D}"
Some of these are being rejected early by the talend processing. They can be found in /data/work/taxxas/Processed/rejected.csv (theres also a vernacular_rejected.csv). The sanity checking rules may be over strict.
I've already located these and am preparing to have the source data corrected. They're appear to be rejected for using unexpected characters in one of FullName, Epithet, Author or YearOfPub. There is one other record being dropped somewhere (Phytobiota, a synonym for Plantae) and I've yet to hunt it down.
It gets stripped out into 'invalid_synonyms.csv' by the process that creates the directory /data/work/taxxas/DwC
Test script is already added. The test script can check what names are missing uaing ID or name.
@ess-acppo-djd identified 5 missing records between the input
tblBiota_20180620.csv
and the generated SOLR index.