Sparql performance improvement ?

PonteIneptique commented 6 years ago

Ok, I had to check some stuff :

Here is the performance of the resolver (without cache) :

from capitains_nautilus.cts.resolver import SparqlAlchemyNautilusCTSResolver
from capitains_nautilus.cts.resolver import NautilusCTSResolver
from MyCapytain.common.constants import Mimetypes

timeit = 100
resolver = SparqlAlchemyNautilusCTSResolver(
    ["./tests/testing_data/latinLit2"],
    graph="sqlite:///2.sqlite"
)
resolver.parse()

print("Parsed 1")
from time import time

current = time()

for _ in range(timeit):
    resolver.getMetadata().export(Mimetypes.XML.CTS)
now = time()

print("{timeit} operations in {sub} : {opsec} sec/op".format(
    timeit=timeit,
    sub=now-current,
    opsec=(now-current)/timeit
))

Cache based : 100 operations in 0.6036763191223145 : 0.006036763191223145 sec/op
SQLAlchemy Based : 100 operations in 281.28646206855774 : 2.8128646206855774 sec/op

Obviously, both of them would be cached at the HTTP serving but this seems to be so much of a loss... I need to do more research about it as this benchmark does not take into account:

memory loss of retriever when process goes to sleep (and the need of reparsing)
pretty sure single object will be faster in this kind of context

Stillm, there is an obvious need to improve this performance.

PonteIneptique commented 6 years ago

Performance bottlenecks ordered by deepest units with own time

Performance bottlenecks ordered by time :

PonteIneptique commented 6 years ago

Nevertheless, the changes made to MyCapytain were worth it, as it will allow Nautilus to keep up to date with the original system while making some improvement, but also allow to run a graph store on top of it for real sparql query.

PonteIneptique commented 6 years ago

Few ideas in improvement : Have a "In-Memory" cache of some data using some kind of SparqlGenerator singleton (thanks to @MrGecko for the idea)

class InMemorySparqlCache(object):
  def __init__(self, cache=None):
    self.generated = {}
  def generate_textgroup(self, identifier, *args, **kwargs)
    if identifier not in self.generated:
      self.generated[identifier] = self.classes["textgroup"](identifier, *args, **kwargs)
    return self.generated[identifier]

This should be implemented on top of in memory metadata caching for SparqlCollection objects

PonteIneptique commented 3 years ago

Benched with #91

Capitains / Nautilus

Sparql performance improvement ? #67