input data CSV (tblBiota) cleanup

mbohun commented 6 years ago

In order to load data (load the CSV tblBiota we are getting daily from NAQS) into the DB the CSV tblBiota data has to be cleaned up, as follows:

[x] trim the leading/trailing whitespaces from the values
[x] year-s have to be simple integers like 1906; however the CSV includes year values like:
- 1920-25
- 1920-1925
- (1920 - 1925)
- etc.

[x] date-s have to be in the following format: "%Y-%m-%d %H:%m"

test_date=`date +"%Y-%m-%d %H:%m"`
echo "TEST DATE: $test_date"
TEST DATE: 2018-08-06 17:08

[x] False, True has to be: FALSE, TRUE
[x] use/replace "DAWR DAWR database" (author and reference) with "Northern Australia Quarantine Strategy TopWatch database." NOTE: This needs double-checking if this replacement is contained to the CSV file OR if it is part of the load data sql script.

NOTES:

These (listed above) are the "data cleanup" filter-s identified so far; these filter-s are essential in order to have the tblBiota CSV data loaded into the BNTi (icn) env.
There are numerous other data cleanup/normalization filter-s we created/tested in past, for example a filter that is extracting/parsing a filed that contains author and year info, and stores the date, and author's name in separated fields; we can integrate more filters as we go, as required

ess-acppo-djd commented 6 years ago

There are known issues with the data in tblBiota that result from functional deficiencies and bugs in the originating software (NAQS TopWatch database, Taxon explorer component, based on CSIRO BioLink database).

the column vchrFullName is system generated but is not always generated correctly.
Literature Names and Available names at rank species contain a full scientific name (Genus species) in the vchrEpithet column
valid/accepted names at any rank only contain the epithet for that rank in the vchrEpithet column
conversion of a name from valid to (accepted or literature) name will not recreate the vchrFullName
the vchrAuthor column often contains a date value as well as author name
vchrAuthor often contains multiple author names, but in varying formats

pmcneil commented 6 years ago

note multiple author names for references are usual. Authors with brackets e.g. (Muell) are "Base Authors" That stuff could be cleaned up later?

ess-acppo-djd commented 6 years ago

The load written by @ArnabRaxit separated author names, but I expect some of the edge cases would have had incorrect results. I'd prefer to have fairly punitive filtering on input as my experience is that bad data that goes into production stays there. Author names are particularly rough in the input data, due to the design of the 'new name' dialogue in the Taxon Explorer. They often contain unmatched parentheses, dates, and parts of subspecific names (subsp. and var. especially).

pmcneil commented 6 years ago

@ess-acppo-djd our experience is that focussing on fixing all the data before loading it into a system that is designed to help you fix all the data means it never happens. Our experience has been that the NSL (Binti) system shines a light on bad data, and people fix it much more quickly.

Having said that: obvious and relatively simple bulk fixes that don't get bogged down are always a good thing. but the person doing the fixing needs to understand Names or have ready access to someone who does.

mbohun commented 6 years ago

verify_tblBiota_dates.sh

#!/bin/bash

cat $1 | while read line
do
    year_of_pub=`echo $line | cut -d'"' -f10 | grep -v "^[1-2][0-9][0-9][0-9]$"`
    [[ ! -z $year_of_pub ]] && echo "$line"
done

mbohun@linux-cr70:~/src/ANBG/issues/14_input_data_tblBiota_CLEANUP> time ./verify_tblBiota_dates.sh tblBiota_20180808.csv > tblBiota_20180808_invalid_year_of_pub.csv

real    1m26.365s
user    1m34.788s
sys     0m34.870s

mbohun@linux-cr70:~/src/ANBG/issues/14_input_data_tblBiota_CLEANUP> ls -lahF
total 17M
drwxr-xr-x 2 mbohun users  177 Aug  8 16:45 ./
drwxr-xr-x 4 mbohun users  163 Aug  8 12:48 ../
-rw-r--r-- 1 mbohun users  154 Aug  8 12:56 cleanup_tblBiota.py
-rw-r--r-- 1 mbohun users 8.4M Aug  8 15:54 tblBiota_20180808.csv
-rw-r--r-- 1 mbohun users 8.4M Aug  8 15:55 tblBiota_20180808.csv.orig
-rw-r--r-- 1 mbohun users  12K Aug  8 16:47 tblBiota_20180808_invalid_year_of_pub.csv
-rwxr-xr-x 1 mbohun users  171 Aug  8 16:41 verify_tblBiota_dates.sh*

mbohun@linux-cr70:~/src/ANBG/issues/14_input_data_tblBiota_CLEANUP> cat tblBiota_20180808_invalid_year_of_pub.csv | wc -l
34

tblBiota_20180808_invalid_year_of_pub.csv

NOTE:

This is an extended version on behalf of @ess-acppo-djd that:
- fixes/sanitizes the year string where possible
- plus it collects and stores all the invalid records in a separate CSV file (for human driven review/fix)
when you look at the data/csv most of those records can be cleaned in a script, by passing any problematic entry detected through:
```
echo $problem_date | sed -e 's/(//g' -e 's/)//g'
```
and re-test the date again, and if it passes now (after we stripped all (, ) ) then just write the sanitized/fixed line into the CSV
The only REALLY problematic records (as in: records that have to be fixed by a human) are the "typos" ones:
- 18857
- 198
- 17534
- 19775
- 194

ess-acppo / nsl-infra

input data CSV (tblBiota) cleanup #14