datacite / pidgraph-notebooks-python

MIT License
14 stars 9 forks source link

Description notebook for user story 5 #3

Open mfenner opened 4 years ago

mfenner commented 4 years ago

As a student starting my own PhD work, I want to be able to find all dissertations on a given topic.

NB. This requires a fix for pagination in GraphQL, which is underway (https://github.com/datacite/lupo/pull/511) and should be ready next week.

datasome commented 4 years ago

@mfenner, I have now prototyped the above (see mybinder). The prototype notebook actually contains three different queries: "Machine learning", "COVID" and "Shakespeare", in order to contrast different trends.

I'm not yet sure how to paginate through results in when issuing multiple GraphQL queries, but at worst can always move back to a single "Machine learning" query once the above pagination fix in GraphQL has been deployed.

datasome commented 4 years ago

@mfenner, I have now documented in Markdown the notebook for user story 5. Please note that the top Markdown table appears borderless in Jupyter lab and mybinder.org, but somehow not in github.

I have also replaced the query: COVID with ebola, to satisfy the requirement of showing a more interesting growth trend in the number of dissertations in recent years.

Finally, I have added pie charts showing the number of dissertations per repository, thus discovering the source of German words in the Shakespeare word cloud - as Universities of Heidelberg and (prominently) Vienna featured as repositories.

datasome commented 4 years ago

@mfenner, to allow Frances to work on feedback on this user story I have switched off the pagination functionality (that causes the 'Invalid AST Node' error) - until we work out how to make gql pagination through results work. In addition, am only fetching 100 first results, because fetching 200 or more, causes 'Cannot return null for non-nullable field Creator.name' exception.

FrancesMadden commented 4 years ago

Feedback on Documentation aspects: Cell 1 - Inserting some sort of visual representation of what the results of the notebook will be, as there are many outputs for this one, perhaps just one visual I think the introductory sentences might be rephrased to be a bit clearer: 'This notebook uses the DataCite GraphQL API to retrieve all dissertations for three different queries: Shakespeare, Machine learning and Ebola. These queries illustrate trends in the number of dissertations created over time.' Beneath 'Define and run GraphQL query'
Cell 115 - is the comment at the beginning of the cell correct? 'Find all outputs FREYA project…'

datasome commented 4 years ago

@FrancesMadden, thank you for the comments. I have just pushed a change to address them - please let me know if anything is still outstanding.

datasome commented 4 years ago

@mfenner, I've just pushed the change to page through results - now retrieving e.g. all ~1700 records for the 'Machine Learning' query. The cursors on GraphQL side work as expected, though for 'Shakespeare' query (114 results) I observed the following (retrieving 100 results per page):

"pageInfo": { "hasNextPage": true, "endCursor": "MTU3MzMyMDEwMjAwMCwxMC4yNTM2NS90aGVzaXMuNjcwMg" }, "pageInfo": { "hasNextPage": true, "endCursor": "MTU5MTgwMzI5ODAwMCwxMC4yNTYwMi9nb2xkLjAwMDI4NzUw" }, "pageInfo": { "hasNextPage": true, "endCursor": null },

whereas after the second page I would have expected:

"pageInfo": { "hasNextPage": false, "endCursor": null },

I've coded around the above, but for the future imho it would be more intuitive to set "hasNextPage" to false when there's no "endCursor".