Discussion: Proposed changes to storage api for 1.0

jdccdevel commented 10 years ago

I'm writing a custom storage backend, and the more I dig into the storage API, the more I see opportunities for improved efficiency.

This seems like a reasonable forum to discuss those improvements.

For example, several places in the code, I've seen this: "if name in (item.name for item in collection.items)"

I'd like to start by suggesting a "has_item(name)" method be added to the "Collection" object to replace this. Alternatively, an "item_names" property could be created, but I believe that would be less efficient.

The code right now is REALLY inefficient, as it requires the storage backend to load the ENTIRE collection, just to test if the collection contains a particular item name. This doesn't allow for any indexing or caching of item names, and hinders any efficiency improvements possible with databases.

This code pattern is repeated with the headers and properties, but it's less of an issue there due to their relatively small size.

untitaker commented 10 years ago

:+1:

Maybe a baseclass should be introduced for storages, implementing methods like has_item with inefficient implementations, which then can be overridden by the storages if appropriate.

xs4 commented 10 years ago

The new lmdbstore backend that I wrote would also greatly benefit from this change. (see https://github.com/Kozea/Radicale/pull/131)

jdccdevel commented 10 years ago

The more I work with the storage code, the more I believe the code needs to be completely refactored. The existing "Collection" base-class tries to do what @untitaker is suggesting, but poorly.

I think the best bet would be to provide 2 new base-classes, both of which implement the python "Dictionary API", to take the place of the current "Collection" class. The Dictionary api provides a clear and familiar api to implement, limiting confusion. Using the python "mixin" helper classes, a basic implementation should be simple. That said, we would probably want to add some transactional methods for persistence (i.e. "save()" or similar)

The first class would be a dictionary of "Collections". This class would provide an abstraction layer similar to the filesystem. It would make resolving requests for a particular item to the correct collection, and returning that collection, much simpler. The constructor for this object should require the "username" of the user accessing radicale (This would allowing for enforced permissions, auditing, etc.)

The second class would the "Collection" stored in the first class. This object would be a dictionary of Items, referenced by their "name". The collection would also contain a "Headers" dictionary, and a "properties" dictionary.

I have little issue with the "Item" classes, except to say that "Headers" shouldn't be items. Each "Item" object should have a name, tag, and text (possibly mime-type, but that can be calculated from the tag). That seems to work quite well now, and I don't see a need for more complexity than that.

liZe commented 10 years ago

@jdccdevel I definitely agree with you (one detail: we must use "ordereddict"s, not "dict"s). We'll probably need to add some public methods or allow to override ordereddict's ones to allow simple overriding possibilities without rewriting the whole classes, but your idea is the way to go.

For the record, #14 is about locking, it may help for transactional methods.

jdccdevel commented 10 years ago

After re-reading the python docs, a straightforward "Mapping" should be all we need (maintaining order would be the responsibility of the backend.) See: http://docs.python.org/2/reference/datamodel.html#emulating-container-types for detail about the methods we would have to implement.

@liZe Transactions and locking are definitely an issue... especially for filesystem style collections. Honestly, I think that the best way to handle transactions would be through managing the lifetime of the "Collection" and "Item" classes by accessing them through a context manager. The "with" statement would be used for accessing collections and items in the radicale code. The main advantages to this are we don't have to override the "Item" base-class in the backend, and it allows for simple, easily implemented, collection locking (per-item locking too, if wanted). The idea would be that the context-manager(s) would update the persistent storage when execution leaves the context. We could have the mapping return a context-manager, and we could add separate context-manager methods. Explicit context-manager methods might be useful for different kinds of locking and better concurrency. (i.e. read-only locks, when we know the code isn't going to change anything.)

I'm thinking the code would look something like (Assume the "Storage" class is the dictionary of collections described above):

"""
Storage is the main class from the plugin, 
most Storage parameters initialized from the config
"""
storage = Storage(username)
with storage["user/calendar.ics"] as collection:
    """ transaction started """
    with collection[item_name] as item:
        """ Item locked """
        item.tag = new_tag
        item.text = new_text
    """ Transaction updated with new item values """
""" Transaction commited, or rolled back if exception occurred """
with storage.read_context("user/callendar.ics") as ro_collection:
    """ ro_collection is locked read-only, any change will throw a exception """
""" read-only Transaction Closed """

There are several other ways we could approach Items, but I don't think they're as elegant: 1) Items could be independent copies of the data in the collection. In this instance, Changing them directly doesn't change the collection itself. (i.e. Items in the collection must be set explicitly.) The advantage I see in this is reduced transactional complexity. Also, the "Item" class and all subclasses would be common across all backends, simplifying the implementation of new storage backends. The disadvantage to this is some code constructs wouldn't behave as might be expected. i.e.

""" This wouldn't work as expected if __getitem__() returns a copy"""
collection["abcdefg"].text = some_text

2) Items are the data in the collection. The advantage here is that code constructs like the one above would do what is expected. The problem with this is that we're overriding a base-class, which is somewhat awkward, and updates might be too fine-grained for good efficiency with some backends.

untitaker commented 9 years ago

I've been using this interface for my sync client without many problems: https://github.com/untitaker/vdirsyncer/blob/master/vdirsyncer/storage/base.py

untitaker commented 9 years ago

The main issue I have with the Radicale storage API is that the whole collection is loaded and written. This has no real performance implications for the filesystem storage, but for multifilesystem, item creation and update should be implementable without updating every item in the collection.

liZe commented 8 years ago

@untitaker your API is wonderful, thanks a lot!

Kozea / Radicale

Discussion: Proposed changes to storage api for 1.0 #130