Add total occurrence count to literature index

gbif / content-crawler

Crawls CMS and articles from Mendeley into ElasticSearch indexes

Apache License 2.0

1 stars 1 forks source link

Add total occurrence count to literature index #59

Open MortenHofft opened 4 months ago

MortenHofft commented 4 months ago

Suggestion: Add the sum of occurrences in the various downloads associated with this paper to the index. This could be a useful indicator of relevance.

Reason: E.g. this paper Gentiana kurroo Royle (Gentianaceae), a highly medicinal, critically endangered and endemic species of the Western Himalayas with restricted distribution in India and Pakistan. that has downloaded 2.6 billion records.

The paper seemingly deal with a fairly narrow subject but haven't added any filters aside from a presence only

dnoesgaard commented 4 months ago

I'm not sure I follow the logic.

You would like literature index entries to have a field that is the sum of number of occurrences in all downloads cited?

So for the paper mentioned, this would be 2.6 B — but as you also point, it's clear that they didn't actually use that many records?

MortenHofft commented 4 months ago

Yes. Making it a far less intersting paper to look at for publishers that want to know how their data is used. It could perhaps provide an easier way to look at a title and the occurrence count and evaluate "is this paper really using my data in any interesting way"

The idea is that it would be a very simply thing to add, that would help in evaluating relevance for me as a data publisher

It is related to this user question btw: https://github.com/gbif/content-crawler/issues/58

dnoesgaard commented 4 months ago

But how would showing 2.6 B help in making this distinction when the the actual number of used records is clearly much lower?

MortenHofft commented 4 months ago

I'm making this up but:

It is just an indicator that it probably isn't at all relevant for my collection. Just a lower probability. And even if they did use all 2.6 billion, then my data is less essential. I'm interested in those papers that couldn't have been written without my data. I care most about the small downloads.