Implement a resolver with a second backend for collections

PonteIneptique commented 7 years ago

Currently, the only resolver we have has a backend directly read from XML or from cache.

This new resolver should :

Inherits from the wonderful current resolver. To allow for switching from one to the other and continue improving both side by side.
Provide a connection to a backend for the RDFLib store. Most probably with an ORM that allows multiple solutions such as SQLAlchemy or a more graph oriented database (but less common for devs...) like Mongo (I could not find better examples for now)
1. Retrieving metadata about text is already fully dependant on RDFLib but...
2. There will be a need to rewrite graph traversal of the collection so to work with RDFLib. IE, rewrite getitem(), .parent, .ancestors, . descendants to access this information through RDFLib. It means mostly that the new resolver should have it's own Collection/Textgroup/etc. system. But again, the modification would be light...
Implements caching of answers for this metadata (Because cache would mostly be used really as cache)
(Optional) Potentially think about reusing the same backend to store list of references for each texts. That could speed up some other part of the code (?)

sonofmun commented 7 years ago

Some Notes:

The new resolver (the Extended Resolver) would manage the store already existing as a global in MyCapytain (as it stores all the metadata of all files). It should definitely reuse a RDFLib Store Adapter, more likely a SQLAlchemy one because it provides another layer of adaptation ( https://github.com/RDFLib/rdflib-sqlalchemy )
MyCapytain will be responsible for setting up the Graph, Nautilus for adapting the store for the Extended Resolver: Nautilus would need to provide also subclasses of collections of MyCapytain to deal with the tree navigation that currently occurs in dictionaries and lists (.descendants, .children, .readableDescendants, etc.)
The collection metadata will be removed from the cache in favor of the rdflib store

sonofmun commented 7 years ago

More information on the current process:

The set up :

The resolver with resources that need parsing are declared in https://github.com/OpenGreekAndLatin/leipzig_cts/blob/master/modules/capitains/templates/app.py.erb#L75-L79
The inventory is actually built with this information in https://github.com/OpenGreekAndLatin/leipzig_cts/blob/master/modules/capitains/templates/update_capitains_repos.rb.erb#L68-L71 : everytime our corpora change, we rebuild some of the cache : we do a parse to get the inventory in cache
parse is called by the manager, which goes into every xml file (text or metadata) to build some of the information needed : https://github.com/Capitains/Nautilus/blob/master/capitains_nautilus/cts/resolver.py#L161-L258
App just calls the resolver in most of its queries . It's basically the core of the app.

Workflow when running

Anytime we need to access metadata (name of a text, citation scheme, text itself) we hit the inventory.
The inventory, defined here https://github.com/Capitains/Nautilus/blob/master/capitains_nautilus/cts/resolver.py#L87-L91,
if it dropped from cache, it will ask to reparse the whole thing . It is most likely the reason for 502 because this can take a really long time for a normal process (ie it should not be the case)

sonofmun commented 7 years ago

All these new functions should be unit tested.

PonteIneptique commented 6 years ago

Partially implemented in #68

Capitains / Nautilus