Change of the citation/read/year graphs

romanchyla commented 3 years ago

Expected Behavior

should be fast

Actual Behavior

is slow

Steps to Reproduce

search *:* and display citations or any other graph

Discussion:

year graph is O(n^2) -- see this place:

https://github.com/adsabs/bumblebee/blob/97a01ec3f9c35165ab80d01b0c8e980ef4b3f604/src/js/wraps/graph_tabs.js#L85

it should be changed to linear; simply turn that refData into a map (object) and do refData[year] +.... (assuming js returns 0 for values not in there; i'm no longer sure)

citation graph:

also this algorithm is quadratic: https://github.com/adsabs/bumblebee/blob/97a01ec3f9c35165ab80d01b0c8e980ef4b3f604/src/js/wraps/graph_tabs.js#L152

example input:

  "facet_pivot":{
      "citation_count":[{
          "field":"citation_count",
          "value":0,
          "count":231018},
        {
          "field":"citation_count",
          "value":1,
          "count":44683},
        {
          "field":"citation_count",
          "value":2,
          "count":26329},

the function is doing something potentially weird:

collects refereed
collect non-refereeed
sorts the collected data by values in descending order (Y axis)
it cuts off first 2000

so it is really reordering displayed values starting from the highest citation count (which have very small numbers in the count field) and ending with the values that have the highest frequency; then it cuts off the list to only retain first 2000 items

the weirdness is in this: refereed and non-refereed are lumped next to each other; so effectively if a paper was cited 10K by a refereed literature, and 9K times by non-refereed literature; there would be two datapoints in the graph, instead of just one point with 19K on Y axis

and since the values are ordered by Y (descending) it just smoothes out the line; IMO this is a very intentional manipulation of data (and kind of weird)

read graph:

is the same algorithm as citation graph

The problem:

very first obvious thing to fix is the quadratic nature of the algorithm; this affects also the year graph displayed next to every search result (fortunately, we only have data for the past 50-100 years, so 10.000 is not terrible, but it is still completely unnecessary and it will slow down clients)
next, more controversial question is whether we can/should limit the number of facet (facet.limit=-1 vs facet.limit=2000)
- a beautiful solution would be to simply ask SOLR for citation_counts ordered from the highest (facet.sort=index+desc) - unfortunately, that doesn't seem to work (desc is being ignored?)
- if there only was a way to sort by frequencies as negative numbers to get the most cited values first (-13001); sadly i don't know if that can be done (and if yes, it might be expensive - unless we created a separate index; which seems stupid)....
so here is one better option (ask for values in the range; it would be actually more correct than what we are doing right now) but it is two-step process
1. determine maximum citation_count for the query
2. ask for facets count to be binned (facet.range.start=1&facet.range.end=13109&facet.range.gap=6)
3. what comes back can be used directly (after throwing away the zeroes)
this option is by far the most correct and still very efficient
another option is to simply ask for facet.limit=2000; this is one step operation but potentially suffers from a drawback (see below)

From the stats.field we know what is the highest possible Y value
```
  "stats":{
"stats_fields":{
  "citation_count":{
    "min":0.0,
    "max":13109.0,
    "count":557940,
    "missing":62,
    "sum":1.0946681E7,
    "sumOfSquares":3.701068359E9,
    "mean":19.619817543104993,
    "stddev":79.04762000159373}}}}
```
So we know there is at minimum 1 paper with 13109 citations, we would also know the highest citation_count (and how many papers have that number) from our set of 2000 - let's call it X. What we'll not know, is the region between 1..X

But if you look at the distribution of values inside citation_count you will notice that the maximum is 56K; yet 99% of papers have less than 2000 citations; and 99.9% will have less than 5000

http://adsabs.harvard.edu/solr/collection1/query?facet.field=citation_count&facet.range.end=57626&facet.range.gap=100&facet.range.start=1&facet.range=citation_count&facet=on&fl=citation_count&q=*&sort=citation_count%20desc

So it is IMHO totally reasonable to cut off facet.limit=2000; the range of the missing region 1..X will be 5 on 99.9% occassions -- people won't even be able to notice that on the graph 100x150px wide

The range option is the most correct; the second option (facet.limit=2000) is the simplest; we should pick from them -- the current implemantion should not be allowed to stay -- it is quadratic, and if it receives thousands of facets, it is going to end up doing millions of operations (twice)

TODO: we should use facet.method=fc for citation_count and read_counts

aaccomazzi commented 3 years ago

Thanks, lots of stuff here to comment on, so we may want to break things up.

we should, of course, rewrite the sorting of the data structures so that they are not quadratic
we should modify the citation and usage request to not include pivots, which multiplex the data unnecessarily (as an aside, I don't think we have an issue with the logic there regarding refereed vs. non-refereed as a data point (paper) is either in one or the other set, so no double counting)
sorting of the facet values by indexed value (facet.sort=index+desc): I keep finding examples which document this as an implemented behavior, so the only explanation I come up with is this: is it possible that we are not using the JSON facet API in montysolr (https://solr.apache.org/guide/7_3/json-facet-api.html)? Looks like this was introduced in SOLR 7.2, and I don't remember where montysolr aligned itself. There may be advantages to enabling or upgrading to this.
I like the approach you suggest for binning the data, but let's make clear that this will remove existing fuctionality, in particular:
- we would no longer be able to compute h
- we would no longer be able to dynamically modify graph when slider is moved

So if we could accomplish what we want using the sort mechanism, I think this would be win-win

romanchyla commented 3 years ago

good point, the old faceting component cannot sort descending, but the new json api is there. We can't use it in our /query API yet because the json parameters need to be passed via POST. Until now there wasn't anything we weren't able to do with GET but it's the time to reconsider...

so here is an example against solr:

rchyla@tp1:/dvt/workspace/projects/it$ curl 'http://localhost:9987/solr/collection1/query' -d 'q=star&stats=true&stats.field=citation_count&fl=id&json.facet={citation_count:{type:terms,field:citation_count,sort:{index: desc}}}'
{
  "responseHeader":{
    "status":0,
    "QTime":95,
    "params":{
      "q":"star",
      "json.facet":"{citation_count:{type:terms,field:citation_count,sort:{index: desc}}}",
      "stats":"true",
      "fl":"id",
      "stats.field":"citation_count"}},
  "response":{"numFound":491509,"start":0,"docs":[
      {
        "id":"2562931"},
      {
        "id":"4024033"},
      {
        "id":"1615336"},
      {
        "id":"2179614"},
      {
        "id":"1865065"},
      {
        "id":"1942400"},
      {
        "id":"2017455"},
      {
        "id":"1615335"},
      {
        "id":"1941174"},
      {
        "id":"1154082"}]
  },
  "facets":{
    "count":491509,
    "citation_count":{
      "buckets":[{
          "val":12408,
          "count":1},
        {
          "val":10387,
          "count":1},
        {
          "val":9405,
          "count":1},
        {
          "val":7622,
          "count":1},
        {
          "val":7331,
          "count":1},
        {
          "val":6936,
          "count":1},
        {
          "val":6562,
          "count":1},
        {
          "val":5541,
          "count":1},
        {
          "val":5521,
          "count":1},
        {
          "val":5164,
          "count":1}]}},
  "stats":{
    "stats_fields":{
      "citation_count":{
        "min":0.0,
        "max":12408.0,
        "count":491458,
        "missing":51,
        "sum":9369611.0,
        "sumOfSquares":3.152401475E9,
        "mean":19.064927216567845,
        "stddev":77.78770589505883}}}}

This is also by far the most efficient execution; so definitely the best choice.

As an aside to the aside about non/refereed (which becomes mute if we remove the pivots): I'm not following - let's say refereed and non-refereeds values were exact twin duplicates; with the existing logic in place, the graph right now contains

x: 1  , 1,   1, 1, 1, 1
y: 10, 10, 9, 9, 8, 8

whereas when not pivoted

x:  2,  2, 2
y: 10, 9, 8

so if we disable pivots, we'll do the latter (which is "more" correct imo)

aaccomazzi commented 3 years ago

Hurray! Actually this GET request seems to work just fine:

https://api.adsabs.harvard.edu/v1/search/query?q=star&stats=true&stats.field=citation_count&fl=id&json.facet=%7Bcitation_count%3A%7Btype%3Aterms%2Cfield%3Acitation_count%2Csort%3A%7Bindex%3A%20desc%7D%7D%7D

Regarding the pivot topic, I think I misunderstood what you were saying, so yes, we are probably providing incorrect data in the plot.

romanchyla commented 3 years ago

So here is a summary of changes that should happen to the graphs (this is leaving out any changes to the display; if necessary they can be discussed later) -- this is really only about fixing the crucial issue of speed

year graph:

If refData (https://github.com/adsabs/bumblebee/blob/97a01ec3f9c35165ab80d01b0c8e980ef4b3f604/src/js/wraps/graph_tabs.js#L36) (and nonRefData) is turned into a map (object); the internal loop on https://github.com/adsabs/bumblebee/blob/97a01ec3f9c35165ab80d01b0c8e980ef4b3f604/src/js/wraps/graph_tabs.js#L85 can be avoided - this will turn the algorithm from quadratic to linear

I tried to find out if we can limit pivots (there is an option called subfacets -- but it requires a separate query to be executed, which seems overkill). So for now, the query should stay as it is, the only change we should make is to make facet.limit=-1 to become facet.limit=2000. We don't have that many years, but still it's good to be consistent

citation graph

please change to:

defaultQueryArguments: {
        'json.facet': '{citation_count:{type:terms,field:citation_count,sort:{index: desc}},limit:2000}',
        stats: 'true',
        'stats.field': 'citation_count',
      },

We don't need any pivots and best of all, the data is already in the order in which it can be passed to the graph; so no manipulation is needed (besides just pulling the data out) -- the whole block of code lines 139:182 can be removed

reads graph

should receive similar treatment as the citation graph, these are the parameters to use in the query

defaultQueryArguments: {
        'json.facet': '{read_count:{type:terms,field:read_count,sort:{index: desc}},limit:2000}',
        stats: 'true',
        'stats.field': 'read_count',
      },

adsabs / bumblebee