Open mbohun opened 6 years ago
There are known issues with the data in tblBiota that result from functional deficiencies and bugs in the originating software (NAQS TopWatch database, Taxon explorer component, based on CSIRO BioLink database).
note multiple author names for references are usual. Authors with brackets e.g. (Muell) are "Base Authors" That stuff could be cleaned up later?
The load written by @ArnabRaxit separated author names, but I expect some of the edge cases would have had incorrect results. I'd prefer to have fairly punitive filtering on input as my experience is that bad data that goes into production stays there. Author names are particularly rough in the input data, due to the design of the 'new name' dialogue in the Taxon Explorer. They often contain unmatched parentheses, dates, and parts of subspecific names (subsp. and var. especially).
@ess-acppo-djd our experience is that focussing on fixing all the data before loading it into a system that is designed to help you fix all the data means it never happens. Our experience has been that the NSL (Binti) system shines a light on bad data, and people fix it much more quickly.
Having said that: obvious and relatively simple bulk fixes that don't get bogged down are always a good thing. but the person doing the fixing needs to understand Names or have ready access to someone who does.
verify_tblBiota_dates.sh
#!/bin/bash
cat $1 | while read line
do
year_of_pub=`echo $line | cut -d'"' -f10 | grep -v "^[1-2][0-9][0-9][0-9]$"`
[[ ! -z $year_of_pub ]] && echo "$line"
done
mbohun@linux-cr70:~/src/ANBG/issues/14_input_data_tblBiota_CLEANUP> time ./verify_tblBiota_dates.sh tblBiota_20180808.csv > tblBiota_20180808_invalid_year_of_pub.csv
real 1m26.365s
user 1m34.788s
sys 0m34.870s
mbohun@linux-cr70:~/src/ANBG/issues/14_input_data_tblBiota_CLEANUP> ls -lahF
total 17M
drwxr-xr-x 2 mbohun users 177 Aug 8 16:45 ./
drwxr-xr-x 4 mbohun users 163 Aug 8 12:48 ../
-rw-r--r-- 1 mbohun users 154 Aug 8 12:56 cleanup_tblBiota.py
-rw-r--r-- 1 mbohun users 8.4M Aug 8 15:54 tblBiota_20180808.csv
-rw-r--r-- 1 mbohun users 8.4M Aug 8 15:55 tblBiota_20180808.csv.orig
-rw-r--r-- 1 mbohun users 12K Aug 8 16:47 tblBiota_20180808_invalid_year_of_pub.csv
-rwxr-xr-x 1 mbohun users 171 Aug 8 16:41 verify_tblBiota_dates.sh*
mbohun@linux-cr70:~/src/ANBG/issues/14_input_data_tblBiota_CLEANUP> cat tblBiota_20180808_invalid_year_of_pub.csv | wc -l
34
tblBiota_20180808_invalid_year_of_pub.csv
NOTE:
echo $problem_date | sed -e 's/(//g' -e 's/)//g'
and re-test the date again, and if it passes now (after we stripped all (
, )
) then just write the sanitized/fixed line into the CSV
In order to load data (load the CSV tblBiota we are getting daily from NAQS) into the DB the CSV tblBiota data has to be cleaned up, as follows:
1906
; however the CSV includes year values like:"%Y-%m-%d %H:%m"
NOTES: