gbif / content-crawler

Crawls CMS and articles from Mendeley into ElasticSearch indexes
Apache License 2.0
1 stars 1 forks source link

Sorting by number of records for a given dataset or publisher #58

Open MortenHofft opened 2 months ago

MortenHofft commented 2 months ago

I'm not sure how this could be done and perform well, but there has been a request to sort results by relevance for a given publisher or dataset. So e.g. by how many records from a given publisher was downloaded for the data used by that paper.

dnoesgaard commented 2 months ago

I'm sure this would be insanely expensive, but each literature entry could have "derived dataset"-like metadata, adding counts and perhaps fractions to the metadata, e.g.

from

"gbifDatasetKey": [
                        "4fa7b334-ce0d-4e88-aaae-2e0c138d049e",
                        "38b4c89f-584c-41bb-bd8f-cd1def33e92f",
                        "8a863029-f435-446a-821e-275f4f641165",
etc.

to

 {
    "gbifDatasetKey": {
        "4fa7b334-ce0d-4e88-aaae-2e0c138d049e": {
            "count": 67045764,
            "fraction": 0.693
        },
        "3b894fe4-c13c-4a04-b372-4e749ce102e1": {
            "count": 5753111,
            "fraction": 0.0594
        },
        "8a863029-f435-446a-821e-275f4f641165": {
            "count": 3107077,
            "fraction": 0.0321
        },
    }
}

this would then also have to be done by publisher... 🤯