Closed romanchyla closed 2 years ago
Thanks, lots of stuff here to comment on, so we may want to break things up.
facet.sort=index+desc
): I keep finding examples which document this as an implemented behavior, so the only explanation I come up with is this: is it possible that we are not using the JSON facet API in montysolr (https://solr.apache.org/guide/7_3/json-facet-api.html)? Looks like this was introduced in SOLR 7.2, and I don't remember where montysolr aligned itself. There may be advantages to enabling or upgrading to this.So if we could accomplish what we want using the sort mechanism, I think this would be win-win
good point, the old faceting component cannot sort descending, but the new json api is there. We can't use it in our /query
API yet because the json parameters need to be passed via POST. Until now there wasn't anything we weren't able to do with GET but it's the time to reconsider...
so here is an example against solr:
rchyla@tp1:/dvt/workspace/projects/it$ curl 'http://localhost:9987/solr/collection1/query' -d 'q=star&stats=true&stats.field=citation_count&fl=id&json.facet={citation_count:{type:terms,field:citation_count,sort:{index: desc}}}'
{
"responseHeader":{
"status":0,
"QTime":95,
"params":{
"q":"star",
"json.facet":"{citation_count:{type:terms,field:citation_count,sort:{index: desc}}}",
"stats":"true",
"fl":"id",
"stats.field":"citation_count"}},
"response":{"numFound":491509,"start":0,"docs":[
{
"id":"2562931"},
{
"id":"4024033"},
{
"id":"1615336"},
{
"id":"2179614"},
{
"id":"1865065"},
{
"id":"1942400"},
{
"id":"2017455"},
{
"id":"1615335"},
{
"id":"1941174"},
{
"id":"1154082"}]
},
"facets":{
"count":491509,
"citation_count":{
"buckets":[{
"val":12408,
"count":1},
{
"val":10387,
"count":1},
{
"val":9405,
"count":1},
{
"val":7622,
"count":1},
{
"val":7331,
"count":1},
{
"val":6936,
"count":1},
{
"val":6562,
"count":1},
{
"val":5541,
"count":1},
{
"val":5521,
"count":1},
{
"val":5164,
"count":1}]}},
"stats":{
"stats_fields":{
"citation_count":{
"min":0.0,
"max":12408.0,
"count":491458,
"missing":51,
"sum":9369611.0,
"sumOfSquares":3.152401475E9,
"mean":19.064927216567845,
"stddev":77.78770589505883}}}}
This is also by far the most efficient execution; so definitely the best choice.
As an aside to the aside about non/refereed (which becomes mute if we remove the pivots): I'm not following - let's say refereed and non-refereeds values were exact twin duplicates; with the existing logic in place, the graph right now contains
x: 1 , 1, 1, 1, 1, 1
y: 10, 10, 9, 9, 8, 8
whereas when not pivoted
x: 2, 2, 2
y: 10, 9, 8
so if we disable pivots, we'll do the latter (which is "more" correct imo)
Hurray! Actually this GET request seems to work just fine:
https://api.adsabs.harvard.edu/v1/search/query?q=star&stats=true&stats.field=citation_count&fl=id&json.facet=%7Bcitation_count%3A%7Btype%3Aterms%2Cfield%3Acitation_count%2Csort%3A%7Bindex%3A%20desc%7D%7D%7D
Regarding the pivot topic, I think I misunderstood what you were saying, so yes, we are probably providing incorrect data in the plot.
So here is a summary of changes that should happen to the graphs (this is leaving out any changes to the display; if necessary they can be discussed later) -- this is really only about fixing the crucial issue of speed
If refData
(https://github.com/adsabs/bumblebee/blob/97a01ec3f9c35165ab80d01b0c8e980ef4b3f604/src/js/wraps/graph_tabs.js#L36) (and nonRefData
) is turned into a map (object); the internal loop on https://github.com/adsabs/bumblebee/blob/97a01ec3f9c35165ab80d01b0c8e980ef4b3f604/src/js/wraps/graph_tabs.js#L85 can be avoided - this will turn the algorithm from quadratic to linear
I tried to find out if we can limit pivots (there is an option called subfacets -- but it requires a separate query to be executed, which seems overkill). So for now, the query should stay as it is, the only change we should make is to make facet.limit=-1
to become facet.limit=2000
. We don't have that many years, but still it's good to be consistent
please change to:
defaultQueryArguments: {
'json.facet': '{citation_count:{type:terms,field:citation_count,sort:{index: desc}},limit:2000}',
stats: 'true',
'stats.field': 'citation_count',
},
We don't need any pivots and best of all, the data is already in the order in which it can be passed to the graph; so no manipulation is needed (besides just pulling the data out) -- the whole block of code lines 139:182 can be removed
should receive similar treatment as the citation graph, these are the parameters to use in the query
defaultQueryArguments: {
'json.facet': '{read_count:{type:terms,field:read_count,sort:{index: desc}},limit:2000}',
stats: 'true',
'stats.field': 'read_count',
},
Expected Behavior
should be fast
Actual Behavior
is slow
Steps to Reproduce
search
*:*
and display citations or any other graphDiscussion:
year graph is O(n^2) -- see this place:
https://github.com/adsabs/bumblebee/blob/97a01ec3f9c35165ab80d01b0c8e980ef4b3f604/src/js/wraps/graph_tabs.js#L85
it should be changed to linear; simply turn that refData into a map (object) and do refData[year] +.... (assuming js returns 0 for values not in there; i'm no longer sure)
citation graph:
also this algorithm is quadratic: https://github.com/adsabs/bumblebee/blob/97a01ec3f9c35165ab80d01b0c8e980ef4b3f604/src/js/wraps/graph_tabs.js#L152
example input:
the function is doing something potentially weird:
it cuts off first 2000
so it is really reordering displayed values starting from the highest citation count (which have very small numbers in the count field) and ending with the values that have the highest frequency; then it cuts off the list to only retain first 2000 items
the weirdness is in this: refereed and non-refereed are lumped next to each other; so effectively if a paper was cited 10K by a refereed literature, and 9K times by non-refereed literature; there would be two datapoints in the graph, instead of just one point with 19K on Y axis
and since the values are ordered by Y (descending) it just smoothes out the line; IMO this is a very intentional manipulation of data (and kind of weird)
read graph:
is the same algorithm as citation graph
The problem:
very first obvious thing to fix is the quadratic nature of the algorithm; this affects also the year graph displayed next to every search result (fortunately, we only have data for the past 50-100 years, so 10.000 is not terrible, but it is still completely unnecessary and it will slow down clients)
next, more controversial question is whether we can/should limit the number of facet (facet.limit=-1 vs facet.limit=2000)
so here is one better option (ask for values in the range; it would be actually more correct than what we are doing right now) but it is two-step process
this option is by far the most correct and still very efficient
another option is to simply ask for facet.limit=2000; this is one step operation but potentially suffers from a drawback (see below)
From the stats.field we know what is the highest possible Y value
So we know there is at minimum 1 paper with 13109 citations, we would also know the highest citation_count (and how many papers have that number) from our set of 2000 - let's call it X. What we'll not know, is the region between 1..X
But if you look at the distribution of values inside citation_count you will notice that the maximum is 56K; yet 99% of papers have less than 2000 citations; and 99.9% will have less than 5000
So it is IMHO totally reasonable to cut off facet.limit=2000; the range of the missing region 1..X will be 5 on 99.9% occassions -- people won't even be able to notice that on the graph 100x150px wide
The range option is the most correct; the second option (facet.limit=2000) is the simplest; we should pick from them -- the current implemantion should not be allowed to stay -- it is quadratic, and if it receives thousands of facets, it is going to end up doing millions of operations (twice)
TODO: we should use facet.method=fc for citation_count and read_counts