ess-acppo / nsl-infra

This repository contains code to assist installing the application stack as required by the NSL apps (editor, mapper and services) in automated fashion. It also contains code to enable CI/CD using jenkins in groovy...
Other
1 stars 5 forks source link

input data CSV (tblBiota) cleanup #14

Open mbohun opened 6 years ago

mbohun commented 6 years ago

In order to load data (load the CSV tblBiota we are getting daily from NAQS) into the DB the CSV tblBiota data has to be cleaned up, as follows:

NOTES:

ess-acppo-djd commented 6 years ago

There are known issues with the data in tblBiota that result from functional deficiencies and bugs in the originating software (NAQS TopWatch database, Taxon explorer component, based on CSIRO BioLink database).

pmcneil commented 6 years ago

note multiple author names for references are usual. Authors with brackets e.g. (Muell) are "Base Authors" That stuff could be cleaned up later?

ess-acppo-djd commented 6 years ago

The load written by @ArnabRaxit separated author names, but I expect some of the edge cases would have had incorrect results. I'd prefer to have fairly punitive filtering on input as my experience is that bad data that goes into production stays there. Author names are particularly rough in the input data, due to the design of the 'new name' dialogue in the Taxon Explorer. They often contain unmatched parentheses, dates, and parts of subspecific names (subsp. and var. especially).

pmcneil commented 6 years ago

@ess-acppo-djd our experience is that focussing on fixing all the data before loading it into a system that is designed to help you fix all the data means it never happens. Our experience has been that the NSL (Binti) system shines a light on bad data, and people fix it much more quickly.

Having said that: obvious and relatively simple bulk fixes that don't get bogged down are always a good thing. but the person doing the fixing needs to understand Names or have ready access to someone who does.

mbohun commented 6 years ago

verify_tblBiota_dates.sh

#!/bin/bash
​
cat $1 | while read line
do
    year_of_pub=`echo $line | cut -d'"' -f10 | grep -v "^[1-2][0-9][0-9][0-9]$"`
    [[ ! -z $year_of_pub ]] && echo "$line"
done
mbohun@linux-cr70:~/src/ANBG/issues/14_input_data_tblBiota_CLEANUP> time ./verify_tblBiota_dates.sh tblBiota_20180808.csv > tblBiota_20180808_invalid_year_of_pub.csv

real    1m26.365s
user    1m34.788s
sys     0m34.870s
mbohun@linux-cr70:~/src/ANBG/issues/14_input_data_tblBiota_CLEANUP> ls -lahF
total 17M
drwxr-xr-x 2 mbohun users  177 Aug  8 16:45 ./
drwxr-xr-x 4 mbohun users  163 Aug  8 12:48 ../
-rw-r--r-- 1 mbohun users  154 Aug  8 12:56 cleanup_tblBiota.py
-rw-r--r-- 1 mbohun users 8.4M Aug  8 15:54 tblBiota_20180808.csv
-rw-r--r-- 1 mbohun users 8.4M Aug  8 15:55 tblBiota_20180808.csv.orig
-rw-r--r-- 1 mbohun users  12K Aug  8 16:47 tblBiota_20180808_invalid_year_of_pub.csv
-rwxr-xr-x 1 mbohun users  171 Aug  8 16:41 verify_tblBiota_dates.sh*
mbohun@linux-cr70:~/src/ANBG/issues/14_input_data_tblBiota_CLEANUP> cat tblBiota_20180808_invalid_year_of_pub.csv | wc -l
34

tblBiota_20180808_invalid_year_of_pub.csv

NOTE:

  1. This is an extended version on behalf of @ess-acppo-djd that:
    • fixes/sanitizes the year string where possible
    • plus it collects and stores all the invalid records in a separate CSV file (for human driven review/fix)
  2. when you look at the data/csv most of those records can be cleaned in a script, by passing any problematic entry detected through:
    echo $problem_date | sed -e 's/(//g' -e 's/)//g'

    and re-test the date again, and if it passes now (after we stripped all (, ) ) then just write the sanitized/fixed line into the CSV

  3. The only REALLY problematic records (as in: records that have to be fixed by a human) are the "typos" ones:
    • 18857
    • 198
    • 17534
    • 19775
    • 194