gbif / hp-living-norway

Living Norway Ecological Data Network is facilitating FAIR management of ecological data to the benefit of society and science; https://livingnorway.no/
1 stars 0 forks source link

Get stats from gbif graphql api #23

Closed siwelisabeth closed 2 years ago

siwelisabeth commented 2 years ago

Hi @MortenHofft

@frafra and I have been looking into graphql and have some questions in regards to it. I add it here at our repo, I was not sure if it was relevant to put it under the hosted portal issues.

We want to to use the stat-module on the front page for the Living Norway Portal (as exemplified here: https://github.com/gbif/jekyll-hp-base-theme/commit/82348400e256648292533c8977eb07dfee3a5771) and would like to show stats about dataset citations, like it is done here: https://www.gbif.org/publisher/46fec380-8e1d-11dd-8679-b8a03c50a862 But instead of showing number of citations for NINA, we would like to show number of citations of datasets that belong to the Living Norway Network.

We have experimented with the graphql API and found out that the datasetSearch can take the networkKey as input and that a literature count is returned. We have also found the sum of unique id’s for literature documents. We are not sure if the literature count is the same as citations. Maybe you can point us in the right direction here, what is the literature count? Is there a way to get number of citations from the graphql API (or other ways).

Below are some examples that illustrates what we have tried to do in graphql.

--

Adding up all the “literatureCount” resulting from LivingNorway network: 1288 { datasetSearch(networkKey: "379a0de5-f377-4661-9a30-33dd844e7b9a") { results { literatureCount } } }

Requesting the list of dataset from LivingNorway network, and count all the documents produced by literatureSearch: 366 { datasetSearch(networkKey: "379a0de5-f377-4661-9a30-33dd844e7b9a") { results { key } } } { literatureSearch(gbifDatasetKey: "${datasets[index].key}") { documents { results { id } } } }

Same query, but counting unique IDs for literature documents: 94.

Same query, but with publishing organization set to NINA instead of using LivingNorway as network: 348. { datasetSearch(publishingOrg: "46fec380-8e1d-11dd-8679-b8a03c50a862") { results { key } } }

NINA citations on gbif.org: 389. https://www.gbif.org/publisher/46fec380-8e1d-11dd-8679-b8a03c50a862

MortenHofft commented 2 years ago

Hi The literature API is documented here https://www.gbif.org/developer/literature E.g. https://api.gbif.org/v1/literature/search?gbifDatasetKey=ca0d8107-a2bd-47a1-91a1-250179b534ec

But the API does not allow searching by network (the information is not in the index either).

I think the best option is to leave an issue here https://github.com/gbif/content-crawler/issues/new with a request to roll up literature by networks (just as we do for publishers)

PS: adding up citation counts from the individual datasets will be misleading as the same literature/paper will likely be counted more than once.

frafra commented 2 years ago

PS: adding up citation counts from the individual datasets will be misleading as the same literature/paper will likely be counted more than once.

@MortenHofft I had the same doubt: that is why we got all the datasets, then each literature document for each dataset, and then counting the unique IDs, but such number is much lower than the one reported on the web page, so I am a bit puzzled.

Thanks for the suggestion. I opened a new issue as suggested: https://github.com/gbif/content-crawler/issues/46

MortenHofft commented 2 years ago

Same query, but counting unique IDs for literature documents: 94.

I'm not sure how you got to 94, but I think that must be due to a typo in your query/counting. I get 418 at least (might contain mistakes as well 😄 )

frafra commented 2 years ago

I'm not sure how you got to 94, but I think that must be due to a typo in your query/counting. I get 418 at least (might contain mistakes as well 😄 )

94 was an underestimation, as the GBIF GraphQL API use pagination by default, which can be disabled by using a high value (2**31-1), while the regular API sets limit to 0. Even in this case, the number I get is very low.

I wrote a simple script to make it easier to reproduce the behaviour:

docker run --rm -i node:17 --experimental-fetch --no-warnings <<'EOF'

const { URL, URLSearchParams } = require('url');

async function literature(publishingOrganizationKey) {
  url = new URL('https://www.gbif.org/api/resource/search');
  params = {
    contentType: 'literature',
    publishingOrganizationKey: publishingOrganizationKey,
    limit: 0,
  }
  url.search = new URLSearchParams(params).toString();
  response = await fetch(url);
  return (await response.json()).count;
}

async function graphql(query) {
  return fetch('https://graphql.gbif-staging.org/graphql', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Accept': 'application/json',
    },
    body: JSON.stringify({
      query: query
    })
  })
}

async function literatureGraphql(publishingOrg) {
  const MAX_LIMIT = 2**31-1;
  response = await graphql(`{ 
    datasetSearch(publishingOrg: "${publishingOrg}", limit: ${MAX_LIMIT}) {
      results {
          key
      }
    }
  }`);
  ids = [];
  datasets = (await response.json()).data.datasetSearch.results;
  for (const index in datasets) {
    const documents = await graphql(`{
      literatureSearch(gbifDatasetKey: "${datasets[index].key}", limit: ${MAX_LIMIT}) {
        documents { results { id } }
      }
    }`);
    const results = (await documents.json()).data.literatureSearch.documents.results;
    results.forEach(r => ids.push(r.id));
  }
  return [...new Set(ids)].length;
}

async function compare(publishingOrg) {
  count1 = await literature(publishingOrg);
  count2 = await literatureGraphql(publishingOrg);
  if (count1 != count2) {
    console.log(`literature count differs: ${count1} (GBIF API) vs ${count2} (GBIF GraphQL API)`);
  }
}

// NINA
compare('46fec380-8e1d-11dd-8679-b8a03c50a862')
EOF

I used NINA as organization, to compare the results. This is what I get: literature count differs: 390 (GBIF API) vs 130 (GBIF GraphQL API).

390 is the number shown here: https://www.gbif.org/publisher/46fec380-8e1d-11dd-8679-b8a03c50a862 130 is computed by fetching all the datasets (60 at the moment for NINA), getting the literature documents for each one of them, and skipping the duplicates. It is not clear to me why is there such a difference. I guess that it could be due to a misunderstanding, but it is not trivial to see it, given that the GraphQL API is not documented.

@MortenHofft do you have any idea?

MortenHofft commented 2 years ago

I haven't look in details on above, but you are using a wrong endpoint that is internal for the website (you probably got it from looking at network acticity).

The API is documented here Example query https://api.gbif.org/v1/literature/search?publishingOrganizationKey=46fec380-8e1d-11dd-8679-b8a03c50a862&limit=0

GraphQL is not an official API and under constant change, so I wouldn't recommend using it for anything but play, but you can inspect the schema in the playground. It is likely to change and might have bugs as well. Here is how to get the count

query{
  literatureSearch(publishingOrganizationKey: ["46fec380-8e1d-11dd-8679-b8a03c50a862"]) {
    documents {
      count
    }
  }
}

Both results says: 390

I wouldn't recommend iterating through keys. But if doing so I would use the production APIs Here is an example https://codepen.io/hofft/pen/vYWqXEP?editors=1010 for some reason I only get 388. But at least it is close to 390. I probably did something wrong.

Hope that helps. Thank you for creating that issue.

UPDATE: It see that GraphQL for literature do not respect the limit. So you have only been looking at the first 20 results. I'll get that fixed. But as said, use the production APIs for anything but play.

frafra commented 2 years ago

I am looping through the datasets just because I can filter datasets by networkKey, in order to have a workaround and get statistics about the LivingNorway network citations. In order to do that, I tried filtering with publishingOrg, just to check if the procedure and my assumptions about the GraphQL API are correct.

We used GraphQL for such workaround, as it allow grouping queries in a single request.

I tried to apply the same looping to the standard API, and I get 388 too, which is consistent. I can confirm that the limit parameter on the GraphQL literature query has no effect.

MortenHofft commented 2 years ago

Yeah, I figured that was the reasoning. Let us hope the issue gets picked up instead. This isn't ideal.

Instead of looping which is sloooow. May I suggest below instead:

// query
query($keys: [ID]){
  literatureSearch(gbifDatasetKey: $keys) {
    documents{
      count
    }
  }
}

//variables - the keys you get from your dataset search
{
  "keys": [
    "aea17af8-5578-4b04-b5d3-7adf0c5a1e60",
    "19fe96b0-0cf3-4a2e-90a5-7c1c19ac94ee",
    "346a9b13-5c96-4793-bcd7-d6614950e726",
    "4a00502d-6342-4294-aad1-9727e5c24041",
    "a639542a-654a-427b-9cf1-bde1953bbb52",
    "2c28a663-db16-48d3-9cd0-3c7ec8d8d873",
    "b49a2978-0e30-4748-a99f-9301d17ae119",
    "23217c21-b3e3-4a39-b699-14924cdc1ad3",
    "d0e09c39-ec7a-4821-8232-027f8e56e302",
    "b2824629-9acc-4c49-827e-e560ab438758",
    "08a41903-7411-479a-929d-4a0c9bc40b31",
    "84b9a51f-ec2e-41dc-9d7a-1e3aa411b939",
    "bdaa0157-b8bd-4106-b943-61f1dfcc9792",
    "9801530f-ab2f-4913-9050-d7239d12aed0",
    "6a948a1c-7e23-4d99-b1c1-ec578d0d3159",
    "9ea87732-b88e-488d-a02b-3dc6e9b885e0",
    "ae77cf87-0f7f-4a08-91cc-5d55230fb421",
    "3cf0df27-6416-46a2-ad5e-7d970e5d1a19",
    "c47f13c1-7427-45a0-9f12-237aad351040",
    "594e0bf8-d08c-4a69-9c76-1c620554e719",
    "a60ba16c-b79d-4de5-b697-7bc1df464529",
    "1571f0b4-efac-4e12-86ea-eb57ba6b5b43",
    "edaf95d8-2c34-4792-8170-04d7c79a5a89",
    "a8a9eb9b-ce61-421b-98f0-d36cf06dbfbb",
    "520909f1-dbca-411e-996c-448d87c6a700",
    "39df870d-a03d-434a-a30f-c21f82f2bcba",
    "a12cba4c-c620-4e76-aa22-0ec988824b6d",
    "f1af1d2b-2d22-48e5-857c-0048710e0c16",
    "4a832966-48ad-4d83-858c-044705f74cac",
    "d062e651-c3e6-4e6c-9544-8d573df5af30",
    "23264cb3-b606-475d-81fb-5f22de3c2368",
    "e81deebe-fff1-44d7-b15e-207a859f0e2f",
    "9d1faf2a-1383-43a3-9862-2d5028eca053",
    "b9b2019d-fd2d-4a17-931a-8b5e050a38bd",
    "ca0d8107-a2bd-47a1-91a1-250179b534ec",
    "ced15900-3164-48a3-8b62-12f818cd9fac",
    "36526979-9f6a-47c3-8ca8-b286b8899729",
    "4dac6899-124a-40fc-90d3-aa5a872c99c0",
    "d27fd37e-a619-4326-b294-b3661bee57fe",
    "ea0bb0f7-84cc-4541-96a2-837003f31d99",
    "354aeeea-31ba-4806-a3c4-1a0ec4ed4b51",
    "212e4844-e458-4f2a-a645-c3231364b202",
    "b848f1f3-3955-4725-8ad8-e711e4a9e0ac",
    "ef2ce669-2030-40a9-95a8-5dc0b6e991ad",
    "377b0deb-3dba-472b-97ca-965f7231ee38",
    "cf645849-cbf8-48b7-82ea-a9f132612d62",
    "6f09416a-7540-420a-ab66-d3223ff3af48",
    "7d1136a6-2dbe-48ec-ae09-efcc83a0550b",
    "5af4f4e2-9e95-4374-b4f6-b1f2adf357ff",
    "6db17e9d-47d8-459e-8173-2d32ffa99470",
    "1c8154f9-a6a9-4302-b09a-78ab079dccc8",
    "73e2db01-1a15-4b66-95b9-a909ab0f69aa",
    "c487db8c-c128-437f-9110-ff745b8b00a9",
    "bb94d12c-8a31-4ed1-b554-92fe929ad6c4",
    "126340b0-b00e-44df-9702-e6b6bcdabf27",
    "a8f9f7cf-9c81-4013-8de0-8038b848115f",
    "dd3f037e-f15f-47ef-9bc9-97a645d020dc",
    "819b724f-d32a-48ac-a62a-9a3425b9b0a0",
    "b7d7acc7-bf3e-4ccb-a96f-05c638828915",
    "943fd513-dbe9-4b0e-b7b1-c5520e402b4f"
    ]
}
frafra commented 2 years ago

Thank you for the suggestion. We moved from looping to bulk requests already, but your code is way better. I see that it is counting the unique identifiers only, and without hitting the limit issue. It works just fine, thank you!

frafra commented 2 years ago

We stored the code we are using with CloudFlare Workers here: https://github.com/gbif/hp-living-norway/blob/a2bdae30e2bf7e04216c8b84d996f89159dedbf9/scripts/cloudflare-workers.js.

I think this issue can be closed. Thanks for the help!