AtlasOfLivingAustralia / la-pipelines

Living Atlas Pipelines extensions
3 stars 4 forks source link

Create the list of fields needed to be exported (DwC) #60

Closed sadeghim closed 4 years ago

sadeghim commented 4 years ago
sadeghim commented 4 years ago

The final list of fields that can have DwC properties is:

(
      "acceptedNameUsage",
      "acceptedNameUsageID",
      "accessRights",
      "associatedMedia",
      "associatedOccurrences",
      "associatedReferences",
      "associatedSequences",
      "associatedTaxa",
      "basisOfRecord",
      "behavior",
      "bibliographicCitation",
      "catalogNumber",
      "class",
      "classs",
      "collectionCode",
      "collectionID",
      "continent",
      "coordinatePrecision",
      "coordinateUncertaintyInMeters",
      "country",
      "countryCode",
      "county",
      "dataGeneralizations",
      "dataResourceUid", // this is read from DB but will be ignored during export
      "datasetID",
      "datasetName",
      "dateIdentified",
      "day",
      "decimalLatitude",
      "decimalLongitude",
      "disposition",
      "dynamicProperties",
      "endDayOfYear",
      "establishmentMeans",
      "eventAttributes",
      "eventDate",
      "eventID",
      "eventRemarks",
      "eventTime",
      "family",
      "fieldNotes",
      "fieldNumber",
      "footprintSpatialFit",
      "footprintSRS",
      "footprintWKT",
      "genus",
      "geodeticDatum",
      "georeferencedBy",
      "georeferencedDate",
      "georeferenceProtocol",
      "georeferenceRemarks",
      "georeferenceSources",
      "georeferenceVerificationStatus",
      "habitat",
      "higherClassification",
      "higherGeography",
      "higherGeographyID",
      "identificationID",
      "identificationQualifier",
      "identificationReferences",
      "identificationRemarks",
      "identificationVerificationStatus",
      "identifiedBy",
      "individualCount",
      "individualID",
      "informationWithheld",
      "infraspecificEpithet",
      "institutionCode",
      "institutionID",
      "island",
      "islandGroup",
      "kingdom",
      "language",
      "license",
      "lifeStage",
      "locality",
      "locationAccordingTo",
      "locationAttributes",
      "locationID",
      "locationRemarks",
      "maximumDepthInMeters",
      "maximumDistanceAboveSurfaceInMeters",
      "maximumElevationInMeters",
      "measurementAccuracy",
      "measurementDeterminedBy",
      "measurementDeterminedDate",
      "measurementID",
      "measurementMethod",
      "measurementRemarks",
      "measurementType",
      "measurementUnit",
      "measurementValue",
      "minimumDepthInMeters",
      "minimumDistanceAboveSurfaceInMeters",
      "minimumElevationInMeters",
      "modified",
      "month",
      "municipality",
      "nameAccordingTo",
      "nameAccordingToID",
      "namePublishedIn",
      "namePublishedInID",
      "namePublishedInYear",
      "nomenclaturalCode",
      "nomenclaturalStatus",
      "occurrenceAttributes",
      "occurrenceDetails",
      "occurrenceID",
      "occurrenceRemarks",
      "occurrenceStatus",
      "order",
      "organismQuantity",
      "organismQuantityType",
      "originalNameUsage",
      "originalNameUsageID",
      "otherCatalogNumbers",
      "ownerInstitutionCode",
      "parentNameUsage",
      "parentNameUsageID",
      "phylum",
      "pointRadiusSpatialFit",
      "preparations",
      "previousIdentifications",
      "recordedBy",
      "recordNumber",
      "relatedResourceID",
      "relationshipAccordingTo",
      "relationshipEstablishedDate",
      "relationshipOfResource",
      "relationshipRemarks",
      "reproductiveCondition",
      "resourceID",
      "resourceRelationshipID",
      "rightsHolder",
      "samplingEffort",
      "samplingProtocol",
      "scientificName",
      "scientificNameAuthorship",
      "scientificNameID",
      "sex",
      "specificEpithet",
      "startDayOfYear",
      "stateProvince",
      "subgenus",
      "taxonConceptID",
      "taxonID",
      "taxonomicStatus",
      "taxonRank",
      "taxonRemarks",
      "type",
      "typeStatus",
      "verbatimCoordinates",
      "verbatimCoordinateSystem",
      "verbatimDepth",
      "verbatimElevation",
      "verbatimEventDate",
      "verbatimLatitude",
      "verbatimLocality",
      "verbatimLongitude",
      "verbatimSRS",
      "verbatimTaxonRank",
      "vernacularName",
      "waterbody",
      "year"
    )
djtfmartin commented 4 years ago

Apologies that this is a bit late, but i remembered there was a "Complete Scan" job in jenkins that retrieves a list of fields populated in Cassandra. This might be useful for cross checking outputs. I ran this yesterday to see if it works and it completes in 1hr 16mins. Heres a chart and the output from the jenkins job (the number is the number of records populated with the field).

Screen Shot 2020-05-08 at 11 35 44 am
Field Number of records with populated value
abcdIdentificationQualifier 3
abcdIdentificationQualifierInsertionPoint 3
abcdTypeStatus 3674
acceptedNameUsage 1743692
accessRights 16227
associatedMedia 1916338
associatedOccurrences 118197
associatedReferences 1839047
associatedSequences 429359
associatedTaxa 254071
australianHerbariumRegion 34
basisOfRecord 52360815
behavior 144816
bibliographicCitation 1276628
catalogNumber 62876582
class 27221929
collectionCode 28551398
collectionID 1356223
continent 993388
coordinatePrecision 19436948
coordinateUncertaintyInMeters 49332101
country 48443448
countryCode 27742780
countryConservation 6232467
county 12077636
cultivarName 1797
cultivated 31
dataGeneralizations 11067368
datasetID 15817765
datasetName 15758930
dateIdentified 7725969
day 2490064
decimalLatitude 82317511
decimalLatitudelatitude 1429
decimalLongitude 82317884
defaultValuesUsed 87197596
disposition 67880
distanceOutsideExpertRange 16850
duplicates 13
duplicatesOriginalInstitutionID 7
duplicatesOriginalUnitID 7
dynamicProperties 80346
easting 10948064
endDayOfYear 204399
establishmentMeans 18538077
eventDate 82978035
eventID 57328233
eventRemarks 13708159
eventTime 18632342
family 31405029
fieldNotes 243342
fieldNumber 729374
firstLoaded 60611903
footprintSRS 1121938
footprintWKT 3821973
generalisationToApplyInMetres 827043
generalisedLocality 89
genus 28613834
geodeticDatum 50622422
georeferencedBy 3323738
georeferencedDate 167582
georeferenceProtocol 19559121
georeferenceRemarks 60118
georeferenceSources 2137690
georeferenceVerificationStatus 24267319
habitat 6148397
higherClassification 1098960
higherGeography 1544549
identificationID 2609726
identificationQualifier 607372
identificationReferences 98844
identificationRemarks 9503405
identificationVerificationStatus 24100582
identifiedBy 8310807
identifierBy 14370
identifierRole 270986
individualCount 30147972
individualID 35582
informationWithheld 11215747
infraspecificEpithet 1979723
institutionCode 31007560
institutionID 2143416
institutionName 9944
island 332591
islandGroup 299980
kingdom 24957731
language 2720836
license 1906419
lifeStage 1653304
loanDate 1
loanDestination 2
loanForBotanist 1
loanIdentifier 2
loanSequenceNumber 1
locality 71102228
locationAccordingTo 372941
locationDetermined 87271244
locationID 43136286
locationRemarks 10818221
maximumDepthInMeters 3764897
maximumElevationInMeters 5543904
measurementAccuracy 85506
measurementDeterminedBy 104227
measurementDeterminedDate 223961
measurementID 223961
measurementMethod 223961
measurementRemarks 223961
measurementType 129274
measurementUnit 304638
measurementValue 219962
minimumDepthInMeters 4053540
minimumElevationInMeters 6310101
miscProperties 61677257
modified 16210736
month 9581921
municipality 258663
nameAccordingTo 4931254
namePublishedIn 725742
naturalOccurrence 52
nearNamedPlaceRelationTo 17
nomenclaturalCode 12619127
nomenclaturalStatus 41896
northing 10948060
occurrenceDetails 1126948
occurrenceID 29973652
occurrenceRemarks 19417393
occurrenceStatus 45918466
order 23940058
organismQuantity 1982358
organismQuantityType 2914176
originalNameUsage 145552
originalSensitiveValues 828548
otherCatalogNumbers 4463842
ownerInstitutionCode 14310451
parentNameUsage 218264
phenology 32
photographer 33137
photoPageUrl 6789
phylum 13642866
preparations 7009343
previousIdentifications 1408333
provenance 1
recordedBy 59172070
recordNumber 10125718
relatedResourceID 684
relationshipOfResource 141195
relationshipRemarks 4353
reproductiveCondition 4795186
rights 2407238
rightsholder 2527906
samplingEffort 16457311
samplingProtocol 35403628
scientificName 85763847
scientificNameAuthorship 21362672
scientificNameID 14913235
scientificNameWithoutAuthor 54
secondaryCollectors 177695
sex 6300144
source 223961
species 1223809
specificEpithet 24199077
startDayOfYear 213337
stateProvince 50756535
subfamily 134102
subgenus 164319
subspecies 2254
superfamily 124027
taxonConceptID 250330
taxonID 56597
taxonomicStatus 47583
taxonRank 25861795
taxonRemarks 3916684
type 2
typeStatus 405360
typifiedName 3700
userAssertionStatus 11968008
userId 1163369
verbatimCoordinates 7499432
verbatimCoordinateSystem 18283385
verbatimDateIdentified 239425
verbatimDepth 28283
verbatimElevation 3948103
verbatimEventDate 17049995
verbatimLatitude 19911119
verbatimLocality 17895022
verbatimLongitude 19911101
verbatimSRS 14617062
verbatimTaxonRank 516822
verificationDate 1
verifier 1
vernacularName 66248766
waterBody 754324
year 9752558
zone 11044802
Tasilee commented 4 years ago

This is handy to compare with what @nickdos ran. They are a superset of Darwin Core terms with some oddities such as "decimalLatitudelatitude".

sadeghim commented 4 years ago

Here is the spreadsheet to get the field list common between our Cassandra schema and TDWG/DwC standard: https://docs.google.com/spreadsheets/d/1DkYFLzt9377fXFBROvZbb4JK5rPinvYhcfisMU4K3jE/edit?usp=sharing

sadeghim commented 4 years ago

@M-Nicholls could you please have a look at the spreadsheet and let me know if there is an issue with it? It shows all Cassandra columns, then raw columns (excluding processed and qa) and then the matched DwC term for them.

RobinaSanderson commented 4 years ago

From sprint 6 planning session: fields are documented and we can refer to them at a later point if necessary. This can be marked done.