gbif / portal16

GBIF.org website
https://www.gbif.org
Apache License 2.0
24 stars 15 forks source link

Simple/frequent filters #1905

Open MortenHofft opened 5 months ago

MortenHofft commented 5 months ago

Removing the simple/advanced toggle proved very unpopular within the secretariat. So I will add that back. It also prompted a discussion about what filters to include in the simple type.

Which are the simple filters we should show on occurrence search.

@jhnwllr you know what filters are used most frequently on downloads. It is reasonable to assume that reflects which are most frequently used on the UI as well.

Do others have ideas which is the simple filters?

dnoesgaard commented 5 months ago

Can we learn something from web analytics?

Frequency of /search? parameters?

(point being that far from all searches lead to downloads)

jhnwllr commented 5 months ago

@MortenHofft I haven't looked at which filters are used most frequently in a while.

I did this way back in 2018. We could re-cook something similar if it is important. https://gbif.blogspot.com/

MortenHofft commented 5 months ago

Thanks @jhnwllr - it is probably the same as back then. Based on below it is the same

Based on simply looking at the last few hundred downloads it looks to be roughly: taxonKey, hasCoordinates, hasGeoSpatialIssue country, continent, gadm (very rarely a geom filter) year, month basisOfRecord

we could then add license, occurrenceStatus and issues simply because we believe they are important dataset and publishers are very rarely used. I didn't see one case in the first few hundred downloads. Secretariat use might be very different though

MortenHofft commented 5 months ago

The current list of simple filters are long and most aren't popular filters https://www.gbif.org/occurrence/search?occurrence_status=present&q=

But they are there because someone at some point decided that this was very important. Sometimes prompted by "real" user feedback (like type) other types coming from the secretariat (e.g. license, iucn, occurrenceStatus)

MortenHofft commented 5 months ago

Personally my gut feeling (somewhat backed by data) is to reduce simple to: occurrenceStatus (educational) license (educational) taxonKey year month Location country continent gadm dataset, publisher (mainly for publishers and secretariat I assume, but also teach users about where data comes from) basisOfRecord (educational) issues (educational)

MattBlissett commented 5 months ago

See https://techdocs.gbif.org/en/informatics/web-logs to query the logs.

Very quickly:

buckets 536371
stateProvince   547530
eventKey        632725
publishing_country      641444
TYPE_STATUS     679099
TAXON_KEY       679154
publishingCountry       681141
year.facetLimit 698770
verbose 701853
gadmGid 721490
coordinate_uncertainty_in_meters        734653
coordinateUncertaintyInMeters   826350
license 854881
event_date      903064
/occurrence/search      918362
establishmentMeans      936325
publishingOrg   942410
hosting_organization_key        957755
SpeciesKey      967576
name_type.facetLimit    1044787
geom    1205445
species_key     1279380
recordedBy      1661021
country.facetLimit      1711170
orderKey        1904599
gadm_gid        1965072
publishing_org  2299395
month   2700218
occurrenceID    2953486
collectionCode  3420397
mediatype       4561405
continent       4702938
secondDimension 4724621
isGeoreferenced 4811250
lastInterpreted 4942254
type    5759946
depth   6497426
basis_of_record 6537020
type_status.facetLimit  7625427
coordinatestatus        8235651
month.facetLimit        8524763
occurrenceId    8912346
scientificname  9490116
geoDistance     9652000
issue.facetLimit        9849302
institutionCode 10668968
basisofrecord   10804202
facetMultiselect        11608639
cachebust       11917492
kingdomKey      12566814
basisOfRecord   15630483
mediaType       15840944
typeStatus      17233716
issue   18447789
dwca_extension.facetLimit       20673033
facetOffset     21904364
catalogNumber   22396800
q       23072408
locale  27204390
decimalLatitude 27623668
decimalLongitude        28074966
eventId 29015063
advanced        30323195
event_id        32839495
speciesKey      35983120
modified        40400665
occurrence_status       45942714
year    58286363
geometry        59074028
eventDate       59617662
scientificName  64556531
hasCoordinate   72938097
hasGeospatialIssue      81299124
facetLimit      90144576
facet   93546298
occurrenceStatus        100666456
country 139312832
datasetKey      152112174
has_geospatial_issue    171145584
dataset_key     176426102
has_coordinate  184099638
offset  235623329
media_type      433701395
taxonKey        444973495
taxon_key       448807411
limit   1167973788

I don't know how useful this is. I'm querying the API, maybe querying the portal would be better — but does that have a hit in Varnish for every search, or is there Javascript magic?

MortenHofft commented 5 months ago

Thanks Matt. That includes all the requests the portal do to generate the pages, charts etc. I could imagine that will skew the results. I suppose that ideally we only look at parameters for https://www.gbif.org/occurrence/[search/map/gallery]

MattBlissett commented 5 months ago

I think some parameters are truncated as there's a limit to the length of the query string that's logged.

depth   108
face    109
eventdate       112
taxo    113
protocol        116
display 116
origin  122
tax     122
networkkey      123
rank    126
ampoccurrencestatus     127
elevation       129
lifestage       135
collectionkey   139
taxonk  141
all     145
ampadvanced     152
occurrenc       157
hostingorganizationkey  162
ampq    164
organismid      165
occurrencestatu 166
occurrence      172
coun    180
occurrencesta   181
basisof 183
programmeid     184
type    185
hascoordinat    199
locality        201
hascoordina     217
seconddimension 221
spatialissues   226
cachebust       237
gbifid  241
yea     251
basi    273
typestatus.facetlimit   276
mdrv    284
recordedbyid    319
hasgeospatialissu       335
month.facetlimit        338
taxon   343
hascoordi       366
utmcampaign     392
utmmedium       397
amp     402
scientificname  433
license 463
fbclid  503
issue.facetlimit        524
ref     556
occurrenceid    557
dwcaextension.facetlimit        562
projectid       576
        586
status  589
highertaxonkey  597
dimension       601
stateprovince   606
amptaxonkey     621
boundingbox     629
h       647
lang    820
coordinateuncertaintyinmeters   833
institutionkey  880
repatriated     911
utmsource       958
iucnredlistcategory     997
nonse   1012
verbatimscientificname  1018
facetmultiselect        1033
taxonke 1058
isincluster     1114
recordnumber    1124
hasgeospatialis 1227
facet   1238
hasgeospatialiss        1244
hasgeospatiali  1391
dwcaextension   1450
gbifdatasetkey  1550
typestatus      1573
hasge   1698
month   1784
eventid 2278
gbiftaxonkey    2608
institutioncode 2967
t       3099
path    3248
hasgeosp        3303
hasgeospat      3373
recordedby      3530
catalognumber   4468
collectioncode  4470
contenttype     4763
publishingcountry       5290
mediatype       5972
ha      6368
continent       6504
has     7215
hasgeospati     7271
hasgeospatia    7389
issue   7841
occ     8779
v       9760
hasgeospatial   11159
hasgeos 11427
hasgeo  12148
limit   13207
gadmgid 14095
hasg    23598
basisofrecord   25141
hasgeospa       26193
year    30686
offset  52660
advanced        62530
country 75962
datasetkey      120749
publishingorg   129824
hash    132148
locale  153199
q       257436
hasgeospatialissue      299560
geometry        360734
occurrencestatus        442444
hascoordinate   446916
taxonkey        584007
MortenHofft commented 5 months ago

Thanks! That looks pretty much as expected in above guessing I would say. The biggest surprise is that repatriated is being used at all