ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
60 stars 13 forks source link

Do we have too many attributes? #1623

Closed dustymc closed 3 years ago

dustymc commented 6 years ago

From https://github.com/ArctosDB/arctos/issues/1597

good guidelines for such additions written in the documentation

Ref: https://github.com/ArctosDB/arctos/issues/1620

My reservation here is that we now have ~200 attributes, half-ish undocumented (https://github.com/ArctosDB/arctos/issues/1450). Some (most!?!) of the "documentation" we do have is not useful for any purpose: "measured how?" or "Standard beak measurement for birds". These things have become numerous enough to start causing problems merely by existing. (Operators and researchers may not find what they're looking for, turning them all on causes serious performance problems, etc.)

This request is clearly data which can fit in Attributes. There's a good definition. Given enough of it, we could ask Arctos cool questions about horse teeth.

We could also push it to Media or structured data or similar, which would support the same questions but with a lot more work. ("unformatted measurements" is NOT a suitable target; these kinds of data are formatted.)

A "these things are Attributes" definition from the AWG would be very useful. (I'd probably vote to continue adding anything that fits and looking for solutions to the problems that causes, but I don't think this is my call.)

Here are existing Attributes by frequency of usage. Can anything that's not used much be removed or merged or ??


UAM@ARCTOS> select ATTRIBUTE_TYPE || ' @ ' || count(*) from attributes group by ATTRIBUTE_TYPE order by count(*) desc;

ATTRIBUTE_TYPE||'@'||COUNT(*)
------------------------------------------------------------------------------------------------------------------------
sex @ 1869853
age class @ 702644
materials @ 544231
weight @ 383072
square @ 333484
total length @ 287081
stratigraphic position @ 286415
description @ 260740
hind foot with claw @ 259809
tail length @ 257973
reproductive data @ 237416
age @ 221849
verbatim collector @ 198110
ear from notch @ 197547
culture of origin @ 196360
provenience @ 148637
number of labels @ 113028
depth @ 110243
provenience north @ 103976
provenience east @ 92836
quadrant @ 92242
unformatted measurements @ 89757
maximum standard length @ 64538
minimum standard length @ 64538
archaeological feature @ 58124
verbatim host ID @ 50249
fat deposition @ 49126
molt condition @ 45498
location in host @ 34512
soft parts @ 30045
examined for parasites @ 24774
verbatim host sex @ 24171
value @ 21836
clutch size @ 21830
body condition @ 21298
stomach contents @ 20639
standard length @ 19901
ectoparasite examination @ 19864
snout-vent length @ 19117
parasites found @ 18628
endoparasite examination @ 18338
verbatim preservation date @ 18240
incubation stage @ 16923
skull ossification @ 16850
numeric age @ 16575
historical @ 16202
culture of use @ 14400
bursa @ 14339
endoparasites detected @ 11945
ectoparasites detected @ 11503
SNV results @ 8222
wing chord @ 5357
height @ 5039
forearm length @ 4834
framed @ 3741
object title @ 3740
dimensions @ 3374
credit line @ 3242
tragus length @ 3220
colors @ 3172
nest description @ 2760
verbatim host age @ 2528
image confirmed @ 2493
soft part color @ 2482
experimental @ 2372
inscriptions and marks @ 2340
tail condition @ 2174
crown-rump length @ 1858
curvilinear length @ 1753
axillary girth @ 1618
extension @ 1494
ear from crown @ 909
tarsus length @ 906
parts examined @ 770
trap identifier @ 667
gonad @ 608
culmen length @ 584
hind foot without claw @ 509
diploid number @ 469
abundance @ 440
breadth @ 371
tested for presence @ 334
egg content weight @ 314
bill width @ 269
right gonad length @ 216
head length @ 196
water temperature @ 183
air temperature @ 183
electrical conductivity @ 179
head width @ 169
left gonad length @ 150
wing span @ 140
radiometric date @ 133
trap type @ 125
bill depth @ 111
crop contents @ 97
hind limb length @ 83
clutch size of nest parasite @ 79
right gonad width @ 74
eggshell thickness @ 73
left gonad width @ 65
maximum total length @ 64
minimum total length @ 63
ovum @ 62
nest phenology @ 60
NAGPRA category @ 53
bill length @ 50
exhibit caption @ 45
caste @ 35
isotope value @ 30
carapace length @ 25
body width @ 25
neck width @ 19
width @ 17
tail base width @ 17
plastron length @ 15
straight carapace length @ 13
curved carapace length @ 5
year class @ 4
bursa length @ 2
brood patch @ 2

121 rows selected.
dustymc commented 5 years ago

This is still a problem.

These are not used:

UAM@ARCTOS> select distinct attribute_type from ctattribute_type where attribute_type not in (select attribute_type from attributes);

ATTRIBUTE_TYPE
------------------------------------------------------------------------------------------------------------------------
brood parasite present
bursa width
embryo weight
forewing length
keywords
middle toe length
nottitle
numeric abundance
tooth length
tooth width

10 rows selected.

I will plan to delete them if nobody objects in the next few days.

Can we clean up or document the rest of the list? I think about half of it's misplaced "reproductive data" and the other half is something about parasites...

Jegelewicz commented 5 years ago

water temperature @ 183 air temperature @ 183

HMMMM. These "environmental attributes" seem like they belong with the collecting event....This is probably going to become a larger issue when the UTEP ants start going in. There is A LOT of environmental data with them (soil moisture, soil type, soil temperature).

I am overwhelmed with taxonomy, geology, and locality right now and not sure who else has time to take this on. If no one is worried about it right now, I say we put it on the back burner until some of the problems people are actively working on are resolved. @dustymc do you need it resolved for something pressing?

dustymc commented 5 years ago

Yes some clearly need moved elsewhere if we ever get a better home for them. I am absolutely fine with using Attributes as a staging area; most anything can be denormalized there. I'm fine with obscure attributes - if there's a real chance someone's going to ask hard questions of tail base width then we should absolutely keep it, even if it's only used 17 times per decade. (But it needs documented - maybe those 17 determinations also represent 17 techniques and these data are therefore completely useless.) I think the temperature attributes are just new - not a problem. I am not OK with having many ways of doing something - of presenting confounded and unusable data to the world - and from here that's what a bunch of this looks like.

Not pressing, I was just answering an email regarding attribution and became overwhelmingly re-appalled at the attribute row of https://docs.google.com/spreadsheets/d/1ElIuKfljO48gaosR7Ml1irSzPdjxiOmhId81ybsrfMQ/edit#gid=432374024

Jegelewicz commented 5 years ago

Agree with all of the above! Any chance we can ping the person who created each attribute without a definition to get them to supply one?

dustymc commented 5 years ago

ping the person who created each attribute without a definition

Not really - we started tracking who's creating authorities and requiring definitions at about the same time. I can do this though:


begin
  for r in (select distinct attribute_type from ctattribute_type where description is null order by attribute_type) loop
    dbms_output.put_line(r.attribute_type);
    for c in (
      select 
        guid_prefix, 
        count(*) cnt 
      from 
        attributes, 
        cataloged_item,
        collection 
      where 
        attributes.collection_object_id=cataloged_item.collection_object_id and 
        cataloged_item.collection_id=collection.collection_id and 
        attributes.attribute_type=r.attribute_type 
      group by 
        guid_prefix 
    ) loop
      dbms_output.put_line('    ' ||  c.guid_prefix || ' @ ' || c.cnt);
    end loop;
  end loop;
end;
/

axillary girth
    UAM:Mamm @ 1617
    MSB:Mamm @ 1
bill length
    CHAS:Bird @ 60
    UWYMV:Bird @ 2
    UCM:Bird @ 52
    UTEP:Bird @ 1
brood patch
    MLZ:Bird @ 1
    DMNS:Bird @ 1
bursa width
carapace length
    UTEP:Herp @ 1
    UWBM:Herp @ 16
    MVZ:Herp @ 10
caste
    CHAS:Ento @ 8
    UAM:Ento @ 26
    KNWR:Ento @ 2
clutch size
    MVZ:Egg @ 14839
    UCM:Egg @ 175
    MVZ:Bird @ 1
    MLZ:Egg @ 26
    DMNS:Egg @ 6793
crown-rump length
    UAM:Mamm @ 341
    DMNS:Mamm @ 111
    NMU:Mamm @ 30
    MSB:Mamm @ 1362
    UCM:Mamm @ 16
    MVZ:Mamm @ 3
curvilinear length
    UAM:Mamm @ 1754
diploid number
    UAM:Herb @ 456
    UTEP:Herb @ 18
ear from crown
    UNR:Mamm @ 2
    UAM:Mamm @ 31
    DMNS:Mamm @ 100
    UCM:Mamm @ 514
    MVZ:Mamm @ 270
    MSB:Mamm @ 2
egg content weight
    NBSB:Bird @ 314
eggshell thickness
    NBSB:Bird @ 73
embryo weight
forewing length
gonad
    DMNS:Bird @ 1
head width
    UCM:Herp @ 18
    MVZ:Herp @ 149
    UWBM:Herp @ 2
hind foot without claw
    UTEP:Mamm @ 1
    UAM:Mamm @ 10
    DMNS:Mamm @ 94
    MSB:Mamm @ 413
hind limb length
    MVZ:Herp @ 83
incubation stage
    CHAS:Egg @ 1838
    MVZ:Egg @ 14808
    UCM:Egg @ 245
    MLZ:Egg @ 24
    DMNS:Egg @ 10
middle toe length
number of labels
    UAM:Herb @ 113021
    UAMb:Herb @ 7
ovum
    UWYMV:Bird @ 1
    DMNS:Bird @ 6
parts examined
    MSB:Bird @ 1
    MSB:Host @ 770
plastron length
    UWBM:Herp @ 3
    MVZ:Herp @ 6
    MSB:Herp @ 6
snout-vent length
    UTEPObs:Herp @ 2
    DMNS:Herp @ 1
    UAM:Herp @ 4
    UTEP:Herp @ 3998
    UCM:Obs @ 6
    UTEP:HerpOS @ 580
    UWYMV:Herp @ 1
    UCM:Herp @ 92
    MVZObs:Herp @ 2
    MVZ:Herp @ 17749
    MSB:Herp @ 760
    UWBM:Herp @ 4398
soft part color
    UWYMV:Bird @ 24
    CHAS:Bird @ 264
    MSB:Bird @ 1826
    UCM:Bird @ 48
    MVZ:Bird @ 1
    DMNS:Bird @ 85
    MLZ:Bird @ 253
soft parts
    UWYMV:Bird @ 155
    CHAS:Bird @ 66
    MSB:Bird @ 10430
    MVZ:Bird @ 20107
standard length
    MSB:Fish @ 19893
    UAM:Fish @ 1
    UCM:Fish @ 6
    UAMObs:Fish @ 1
tail base width
    MVZ:Herp @ 17
tail condition
    UTEP:Herp @ 2
    UNR:Herp @ 3
    MVZ:Herp @ 2141
    UWBM:Herp @ 28
wing span
    UAM:Mamm @ 2
    UCM:Mamm @ 138
year class
    UCM:Fish @ 4

PL/SQL procedure successfully completed.

See also https://github.com/ArctosDB/arctos/issues/1450

dustymc commented 5 years ago

ref: https://github.com/ArctosDB/data-migration/issues/71

Consider a "random structured non-core data" attribute to hold things like

[
   {
      "modified_by":"whoever",
      "modified_date":"whenever"
   }
]

which can do ~anything two new attributes can do.

dustymc commented 3 years ago

removed some unused, closing for lack of interest