AtlasOfLivingAustralia / data-management

Data management issue tracking
7 stars 0 forks source link

data quality assertions #279

Open M-Nicholls opened 8 years ago

M-Nicholls commented 8 years ago
  1. FIRST_OF_MONTH. This seems rather lame by comparison with first of year and century? I think we wouldn't loose much by dropping this one.We have 6,350,269 records with this flag!
  2. DECIMAL_LAT_LONG_CONVERSION_FAILED. We need to go international so we can't assume WGS84 standard. We certainly need a test saying that a transformation was required to decimal degrees and it failed. Here I would assume any coordinates other than decimal degrees, e.g., UTMs, decimal minutes, different datum.... Currently 0 records as far as I can see.
  3. RECORDED_BY_UNPARSABLE. As this could be anything (a form of verbatim), we can't see how you could test anything meaningful. Zurrently 0 records
  4. INFERRED_DUPLICATE_RECORD. I haven't seen anything specific about how you are detecting dupes but assume a combination of taxon, location, date/time, catalogueNumber...? As our code is exported, knowing the formula would be handy. 5,411,649 records.
nielsklazenga commented 8 years ago

I would like to make the case for keeping FIRST_OF_MONTH. If more than 10 per cent of records have this flag, this means that there are more than three times as many records from the first of the month than any other day of the month. This is much more likely to be caused by the fact that, if the day of the month is unknown, it is often stored as the first of the month (as you can't have 0 days and months in most RDMSs), than by indeed more collections or observations being made on the first of the month.

FIRST_OF_MONTH is an important flag; in fact more important than FIRST_OF_YEAR, as records for which only the event day is not known are much more frequent than those for which also the month is unknown. There is not really a test for the first day of the century, is there?

ansell commented 8 years ago

First day of the century might be useful for identifying dates that were two-digits before conversion to ISO8601 and may need investigation. Ie, if they were "00" or "0" in the original file they may turn out to be first day of either 1900 or 2000 depending on how the ISO8601 conversion occurs.

nielsklazenga commented 8 years ago

Surely the invalidCollectionDate test takes care of these situations? This is one that I accidentally delivered: http://avh.ala.org.au/occurrences/119a1287-f0ed-4154-b3cf-7e6ce1ee5834.