Improve stats calculation performance

jzohrab commented 1 month ago

Current stats calculation slows down terribly if the page count is increased. After some tinkering, it appears that there are a few things that can be improved.

Branch: wip_issue_484_improve_stats_performance

Some debugging data

From the DebugTimer class, for 100 page stats calculation of Pedro Paramo. Potential big items are marked with <<<

global summary ------------------
  get_paragraphs get_parsed_tokens: 0.087874
  get_paragraphs token.sort: 0.003998
  _find_all_terms_in_tokens single, query prep: 0.026687
  _find_all_terms_in_tokens mwords, loaded 3319 records: 0.009825
  _find_all_terms_in_tokens mwords, filtered ids: 0.415707      <<<<
  _find_all_terms_in_tokens union, exec query: 0.367185        <<<<
  get_paragraphs _find_all_terms_in_tokens: 0.820232
  get_paragraphs _split_tokens_by_paragraph: 0.003325
  TokenLocator setup: 0.048006
  TokenLocator matches: 0.355557   <<<<<
  TokenLocator build term matches: 0.091620
  get_renderable 1 create candidates: 0.673155      < This has the TokenLocator matches
  get_renderable 2 sort: 0.011532
  get_renderable 3 add originals: 0.034943
  get_renderable 4 ids: 0.016583
  get_renderable 5 ids: 0.007799
  get_renderable 6 final build: 0.007703
  _make_renderable_sentence get_renderable: 0.786776    < Mostly from " get_renderable 1 create candidates:"
  _make_renderable_sentence textitems: 0.086136
  get_paragraphs renderable_paragraphs load: 0.882575
  get_paragraphs done add status 0 terms: 0.002534
  get_status_distribution get_paragraphs: 1.780910

and for a 650 page stats calc:

global summary ------------------
  get_paragraphs get_parsed_tokens: 0.697446
  get_paragraphs token.sort: 0.030537
  _find_all_terms_in_tokens single, query prep: 0.178715
  _find_all_terms_in_tokens mwords, loaded 3319 records: 0.048952
  _find_all_terms_in_tokens mwords, filtered ids: 3.073049     <<<<
  _find_all_terms_in_tokens union, exec query: 2.442185     <<<<
  get_paragraphs _find_all_terms_in_tokens: 5.741712    <<<<<
  get_paragraphs _split_tokens_by_paragraph: 0.027727
  TokenLocator setup: 0.317491
  TokenLocator matches: 2.716832
  TokenLocator build term matches: 0.356202
  get_renderable 1 create candidates: 4.503385   <<<<
  get_renderable 2 sort: 0.081563
  get_renderable 3 add originals: 0.368804
  get_renderable 4 ids: 0.112846
  get_renderable 5 ids: 0.051107
  get_renderable 6 final build: 0.049116
  _make_renderable_sentence get_renderable: 5.394524    <<<<
  _make_renderable_sentence textitems: 0.684427
  get_paragraphs renderable_paragraphs load: 6.139826
  get_paragraphs done add status 0 terms: 0.154836
  get_status_distribution get_paragraphs: 12.793263

Possible to-do items

Try using `ahocorapy`

ref https://github.com/abusix/ahocorapy

I have no idea how to use this at the moment, and also have no idea how it would actually perform: that's work for another branch. Looks interesting though. Basically, the code would have to load an ahocorapy automaton with all of the terms, and then use that to check each text. Snippet demoing the idea:

# Load all terms from DB, mapping to IDs so that they can be re-fetched.
# Note: add zero-width space to start and end of all terms.
terms_to_ids = {
    'malaga': 42,
    'lacrosse': 55,
    'mallorca': 66,
    'orca': 77,
    'mallorca bella': 88
}

# Build tree.
from ahocorapy.keywordtree import KeywordTree
kwtree = KeywordTree(case_insensitive=True)
for i in range(0, 100000):
    if i % 1000 == 0:
        print(i)
    for k in terms_to_ids.keys():
        kwtree.add(k)
kwtree.finalize()

# Find all the terms.  Add zero-width space as needed.
results = kwtree.search_all('malheur on mallorca bellacrosse mallorca')
for result in results:
    print(result)
    print(f"Term ID: {terms_to_ids[result[0]]}")
    # todo: calculate the real position of the token using the zero width space delim.

outputs:

('mallorca', 11)
Term ID: 66
('orca', 15)
Term ID: 77
('mallorca bella', 11)
Term ID: 88
('lacrosse', 23)
Term ID: 55
('mallorca', 32)
Term ID: 66
('orca', 36)
Term ID: 77

Changing the stats calculation to be an ajax call

Currently the stats are calculated before the homepage is rendered. That's slow and maybe silly ... better might be to ajax the book stats graphs, b/c we could quickly return the current cached data, and then update it afterwards when the cache calculations are done.

Caching the Term positions for each Text page to be rendered

Currently Lute reparses the Text.text, finds all Terms in that text, and finds all their locations. We might be able to cache some results:

if the parsing results haven't changed from a prior parse, all previously positioned Terms would still have the same positions.
current parse will differ from prior parse:
- if the language parsing settings have changed
- if the page has been edited
- etc
- the parse results could be simply checked with something like an MD5 of all tokens, storing that in the cache table.
any new terms created since the last render check would have to be positioned

jzohrab commented 1 month ago

Done a fair amount, using ahocorapy, and ajaxing in the graph. Good enough for this iteration. Merged into develop.

jzohrab commented 1 month ago

In release 3.5.5.

LuteOrg / lute-v3