The configuration for docker-compose has been adjusted to start ElasticSearch, all references to Solr and MongoDB have been removed
A new "corpus test" has been created see cayenne.corpus-test. This test can work against a corpus of varying size and proves that citation matching is working within a known threshold. An almost identical version of this test is included in a PR to Solr version so a direct comparison can be made. I've attached a scoring comparison of citation matching below. A more complete comparison can be found here
Original DOI
Matched DOI
Elastic
Solr
10.1002/erv.2485
10.1002/erv.2485
100.273125
90.04283
10.1002/jnr.23820
10.1002/jnr.23820
81.95269
61.502754
10.1002/jnr.23992
10.1002/jnr.23992
104.20477
88.88707
10.1002/nur.21773
10.1002/nur.21773
95.3643
85.251755
10.1007/s00125-016-4154-6
10.1007/s00125-016-4154-6
92.42219
83.01552
10.1007/s00213-016-4480-x
10.1007/s00213-016-4480-x
93.45497
76.72064
10.1007/s10964-016-0591-2
10.1007/s10964-016-0591-2
92.13747
82.10659
10.1007/s11302-016-9551-2
10.1007/s11302-016-9551-2
96.21273
75.00612
10.1007/s11682-016-9638-y
10.1007/s11682-016-9638-y
100.7581
86.417206
10.1007/s13318-016-0388-4
10.1007/s13318-016-0388-4
120.53948
105.28446
10.1016/j.alcohol.2016.08.008
10.1016/j.alcohol.2016.08.008
90.90022
78.548706
10.1016/j.bbi.2016.10.007
10.1016/j.bbi.2016.10.007
91.129654
84.56702
10.1016/j.bbr.2016.10.035
10.1016/j.bbr.2016.10.035
101.12494
90.45204
10.1016/j.biopsycho.2016.12.010
10.1016/j.biopsycho.2016.12.010
88.34703
75.23803
10.1016/j.bmc.2016.10.035
10.1016/j.bmc.2016.10.035
110.07812
94.04622
10.1016/j.explore.2016.10.009
10.1016/j.explore.2016.10.009
85.96247
69.95195
10.1016/j.infbeh.2016.09.006
10.1016/j.infbeh.2016.09.006
100.5378
86.74484
10.1016/j.jad.2016.10.035
10.1016/j.jad.2016.10.035
61.279423
53.282127
10.1016/j.jad.2016.11.036
10.1016/j.jad.2016.11.036
90.15741
81.84434
10.1016/j.jad.2016.11.046
10.1016/j.jad.2016.11.046
123.41971
103.81766
10.1016/j.joms.2016.10.033
10.1016/j.joms.2016.10.033
85.54165
76.20327
10.1016/j.neubiorev.2016.12.003
10.1016/j.neubiorev.2016.12.003
75.98824
72.51518
10.1016/j.neubiorev.2016.12.006
10.1016/j.neubiorev.2016.12.006
117.57448
91.75167
10.1016/j.neubiorev.2016.12.013
10.1016/j.neubiorev.2016.12.013
97.39974
87.65282
10.1016/j.neulet.2016.11.064
10.1016/j.neulet.2016.11.064
108.19604
93.43173
10.1016/j.neuro.2016.11.006
10.1016/j.neuro.2016.11.006
97.35805
82.715225
10.1016/j.neurobiolaging.2016.11.014
10.1016/j.neurobiolaging.2016.11.014
100.97907
89.42957
10.1016/j.neuroimage.2016.12.046
10.1016/j.neuroimage.2016.12.046
85.69545
74.715836
10.1016/j.neuron.2016.09.039
10.1016/j.neuron.2016.09.039
67.21429
61.09139
10.1016/j.nicl.2016.11.014
10.1016/j.nicl.2016.11.014
86.846924
76.55675
10.1016/j.nlm.2016.10.006
10.1016/j.nlm.2016.10.006
106.93617
95.132774
10.1016/j.nlm.2016.11.008
10.1016/j.nlm.2016.11.008
65.49764
58.382977
10.1016/j.peptides.2016.11.001
10.1016/j.peptides.2016.11.001
97.45492
82.56204
10.1016/j.physbeh.2016.10.010
10.1016/j.physbeh.2016.10.010
87.95236
70.85148
10.1016/j.physbeh.2016.11.030
10.1016/j.physbeh.2016.11.030
99.73418
85.511086
10.1016/j.physbeh.2016.12.004
10.1016/j.physbeh.2016.12.004
116.00987
93.99558
There has been a lot of "code clean up", most of this was done in the early phase of the elastic branch, a few things worth mentioning that have been removed:
OAI harvester
Datomic-backed graph API
HTML landing page interrogation
Datacite XML parser
DOI metadata quality checker
Web of Knowledge parser
Resolution URL checker
Citation analysis
DOAJ code
Old patent deposit code (now handled by event data)
Deposits API
/licenses route (in favour of license facet)
Old code for citation checking
Index settings are configured to closely match Solr, particularly the number of shards used by the work index matches with the Solr production deployment
There is scope to change this in the future but it is worth keeping in mind that scoring is shard local, so the number of shards directly impacts scoring, in theory this should even out over a large enough corpus
Index Structures
Much of the underlying structure for index was already in place in the elastic branch, I have only made changes to this structure where it fixed an issue.
Change year to be non numeric here. The reasons for this are explained in the commit message.
I also ported mappings required for new master features. e.g. peer reviews, isbn types
Concerns
The changes in this PR are somewhat wider ranging than just swapping in ElasticSearch, as the highlights above show there has been a general clean up and removal of "old code". A large portion of functionality is proven by the passing of existing high level automated tests, however, there may be untested areas which will require testing after deployment.
WIP PR
Purpose
This pull request migrates away from Solr and MongoDB to ElasticSearch.
Highlights
Solr and MongoDB have been removed in favour of ElasticSearch for all data storage. ElasticSearch indexes exist for all of the core data types:
The configuration for
docker-compose
has been adjusted to start ElasticSearch, all references to Solr and MongoDB have been removedA new "corpus test" has been created see
cayenne.corpus-test
. This test can work against a corpus of varying size and proves that citation matching is working within a known threshold. An almost identical version of this test is included in a PR to Solr version so a direct comparison can be made. I've attached a scoring comparison of citation matching below. A more complete comparison can be found hereIndex settings are configured to closely match Solr, particularly the number of shards used by the work index matches with the Solr production deployment
There is scope to change this in the future but it is worth keeping in mind that scoring is shard local, so the number of shards directly impacts scoring, in theory this should even out over a large enough corpus
Index Structures
Much of the underlying structure for index was already in place in the elastic branch, I have only made changes to this structure where it fixed an issue.
year
to be non numeric here. The reasons for this are explained in the commit message.Concerns
The changes in this PR are somewhat wider ranging than just swapping in ElasticSearch, as the highlights above show there has been a general clean up and removal of "old code". A large portion of functionality is proven by the passing of existing high level automated tests, however, there may be untested areas which will require testing after deployment.