Current stats calculation slows down terribly if the page count is increased. After some tinkering, it appears that there are a few things that can be improved.
Branch: wip_issue_484_improve_stats_performance
Some debugging data
From the DebugTimer class, for 100 page stats calculation of Pedro Paramo. Potential big items are marked with <<<
I have no idea how to use this at the moment, and also have no idea how it would actually perform: that's work for another branch. Looks interesting though. Basically, the code would have to load an ahocorapy automaton with all of the terms, and then use that to check each text. Snippet demoing the idea:
# Load all terms from DB, mapping to IDs so that they can be re-fetched.
# Note: add zero-width space to start and end of all terms.
terms_to_ids = {
'malaga': 42,
'lacrosse': 55,
'mallorca': 66,
'orca': 77,
'mallorca bella': 88
}
# Build tree.
from ahocorapy.keywordtree import KeywordTree
kwtree = KeywordTree(case_insensitive=True)
for i in range(0, 100000):
if i % 1000 == 0:
print(i)
for k in terms_to_ids.keys():
kwtree.add(k)
kwtree.finalize()
# Find all the terms. Add zero-width space as needed.
results = kwtree.search_all('malheur on mallorca bellacrosse mallorca')
for result in results:
print(result)
print(f"Term ID: {terms_to_ids[result[0]]}")
# todo: calculate the real position of the token using the zero width space delim.
outputs:
('mallorca', 11)
Term ID: 66
('orca', 15)
Term ID: 77
('mallorca bella', 11)
Term ID: 88
('lacrosse', 23)
Term ID: 55
('mallorca', 32)
Term ID: 66
('orca', 36)
Term ID: 77
Changing the stats calculation to be an ajax call
Currently the stats are calculated before the homepage is rendered. That's slow and maybe silly ... better might be to ajax the book stats graphs, b/c we could quickly return the current cached data, and then update it afterwards when the cache calculations are done.
Caching the Term positions for each Text page to be rendered
Currently Lute reparses the Text.text, finds all Terms in that text, and finds all their locations. We might be able to cache some results:
if the parsing results haven't changed from a prior parse, all previously positioned Terms would still have the same positions.
current parse will differ from prior parse:
if the language parsing settings have changed
if the page has been edited
etc
the parse results could be simply checked with something like an MD5 of all tokens, storing that in the cache table.
any new terms created since the last render check would have to be positioned
Current stats calculation slows down terribly if the page count is increased. After some tinkering, it appears that there are a few things that can be improved.
Branch:
wip_issue_484_improve_stats_performance
Some debugging data
From the DebugTimer class, for 100 page stats calculation of Pedro Paramo. Potential big items are marked with
<<<
and for a 650 page stats calc:
Possible to-do items
Try using
ahocorapy
ref https://github.com/abusix/ahocorapy
I have no idea how to use this at the moment, and also have no idea how it would actually perform: that's work for another branch. Looks interesting though. Basically, the code would have to load an ahocorapy automaton with all of the terms, and then use that to check each text. Snippet demoing the idea:
outputs:
Changing the stats calculation to be an ajax call
Currently the stats are calculated before the homepage is rendered. That's slow and maybe silly ... better might be to ajax the book stats graphs, b/c we could quickly return the current cached data, and then update it afterwards when the cache calculations are done.
Caching the Term positions for each Text page to be rendered
Currently Lute reparses the Text.text, finds all Terms in that text, and finds all their locations. We might be able to cache some results: