gbif / content-crawler

Crawls CMS and articles from Mendeley into ElasticSearch indexes
Apache License 2.0
1 stars 1 forks source link

Suggestion: index taxonKey tag from Mendeley #35

Closed dnoesgaard closed 2 years ago

dnoesgaard commented 3 years ago

In an attempt to categorize literature by taxon, I've started tagging papers using gbifTaxon:<taxonKey>, e.g., b688f91b-8f9f-39e4-a378-6d9375247da8:

"tags": [
"2021",
"Agronomy and Crop Science",
"GBIF_cited",
"HU",
"Horticulture",
"Plant Science",
"Species_distributions",
"citation_type:generic",
"gbifTaxon:7202218",
"open_access:true",
"peer_review:true"
]

If we could make the crawler add this as field to the ES index, we could start featuring literature on species pages, etc.

"gbifTaxonKey": [
"7202218"
]

(@MortenHofft, you might also have thoughts on this)

MortenHofft commented 3 years ago

It makes sense to me. It would also make it more clear what to do with those citations of species pages.

I would like to add gbifOccurrenceKey: [] as well. For those cases where someone cites a few individual occurrences (for example a taxonomic treatment). If I understand correctly, then we currently only count those on dataset level. Counting on dataset level makes perfect sense for large downloads, but when a paper cites a few individual occurrences, then it would be nice to capture that as the occurrence probably plays a larger role.

// in mendeley
[
  "gbifOccurrence:2247859888"
]

would add

// in literature index
"gbifOccurrenceKey": [
  "2247859888"
]
dnoesgaard commented 3 years ago

I've added gbifOccurrence tags for a few papers now:

91710ee8-d590-3953-a6e9-4cfdc608e5da 51974777-846f-335a-8d6a-687d85a5714e (Edit by morten: nice example) 24412bdf-599e-3c60-ae9d-1d72d557772c f9ef5a36-cbd8-3a76-a0f9-3d3262070969

gbifTaxon has already been applied to ~1,500 papers

dnoesgaard commented 3 years ago

(for my own sake, here's how easily pull these from ES using wildcards)

% curl --location --request GET 'cms-search.gbif.org:9200/_search' \
--header 'Content-type: application/json' \
--data-raw '{
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "gbifOccurrence*",
            "fields": [
              "tags"
            ]
          }
        }
      ]
    }
  }
}'
MortenHofft commented 3 years ago

When indexing the taxonKeys I suggest we also resolve the higher ranks and add those. That would make it possible to search for all papers about e.g. a family (and not just papers about specific species).

taxonKey: [456,789],
allTaxonKeys: [456,789,1,2,3,4,5,6,10,11,12,13,14,15,16] //including the leafs for convinence i suppose (similar to occ index). I'm not sure what a good name for that field is
dnoesgaard commented 3 years ago

To summarize, we are suggesting the addition of (at least) three new items to add to the index based on tags in Mendeley:

gbifTaxon -> gbifTaxonKey: [] (+field with higher taxa resolved) gbifOccurrence -> gbifOccurrenceKey: [] gbifFeature -> gbifFeatureId: [] citation_type -> citationType

(perhaps some considerations around nomenclature for these fields is necessary)

dnoesgaard commented 3 years ago

Oh, and while we're at it, can we add this one too?

citation_type -> citationType

fmendezh commented 3 years ago

@dnoesgaard is gbifFeatureId ids of Contentful content? is citation_type a controlled enum/vocabulary that you used?

dnoesgaard commented 3 years ago

For clarity, the "gbifFeature" tag contains a Contentful identifier of a related data use case, allowing the linkage of literature to a GBIF feature of that paper. The "citation_type" tag is used to indicate how a literature item cites GBIF data (e.g. DOI, generic), but it's not entirely controlled (but it probably could be).

MortenHofft commented 3 years ago

(perhaps some considerations around nomenclature for these fields is necessary)

citation_type -> citationType - why is this snake_case and the others camelCase?

gbifFeatureId: is it called feature because it applies to more than just dataUse stories? Could it be any contentful item?

dnoesgaard commented 3 years ago

I appear to have used mostly snake_case in Mendeley tags, but obviously the ES index should use whatever we prefer.

"gbifFeatureId" is just a name but the intention is to link to dataUse items only.

dnoesgaard commented 3 years ago

For citation_type, I believe it could be controlled using

(the latter being when a paper doesn't cite a DOI but provides one when contacted)