PDDL memory representation in regards of usability and multi-threaded applications

drexlerd commented 8 months ago

There are good reasons to move towards value semantics as opposed to pointer semantics for storing PDDL objects. 1) We don't need virtual bases classes but we currently use them, e.g., for conditions, effects. 2) Cache locality: with a combination of handles and views we can get better cache locality. A PDDL object will then be composed of Views to make it easy to access the data members.

A Handle could be implemented as follows:

template<typename T>
class Handle {
    private:
        int m_identifier;

    public:
        Handle(int identifier) : m_identifier(identifier) { }

        int get_identifier() const { return m_identifier; }
};

A View could be implemented as follows:

template<typename T>
class View {
    private:
        Handle<T> m_handle;
        const VectorByHandle<T>* m_vector;  

    public:
        View(Handle<T> handle, const HandleVector<T>& vector) : m_handle(handle), m_vector(&vector) { }

        Handle<T> get_handle() const { return m_handle; }
        const T& get_data() const { return (*m_vector)[m_handle]; }
};

A VectorByHandle could be implemented as follows:

template<typename T>
class VectorByHandle {
    private:
       std::vector<T> m_vector;

    public:
        // delete copy/move to not invalidate the pointer in a View
        VectorByHandle(const VectorByHandle& other) = delete;
        VectorByHandle& operator=(const VectorByHandle& other) = delete;
        VectorByHandle(VectorByHandle&& other) = delete;
        VectorByHandle& operator=(VectorByHandle&& other) = delete;

        const T& operator[](Handle<T> handle) const { return m_vector[handle.get_identifier()]; }
};

jendrikseipp commented 8 months ago

I don't know enough about the details to give concrete advice, but I'd say for a parser the main goal is not speed but legibility of code and ease of maintenance. You'll have a hard time finding tasks that take more than a second to parse. So I'd use simple, flat data structures wherever possible.

drexlerd commented 8 months ago

We don't want to use Loki just as a parser. The final representation is supposed to be usable in a planner or other application. The basic idea is to write proxies for each pddl object with potentially extended functionality. There will be no need to implement low level details again.

namespace mimir {

class SomeType {
private:
    const loki::pddl::SomeType* external_;

    // add attributes for additional representation changes if needed

public:
    SomeType(const loki::pddl::SomeType* external) : external_(external) {}

    // add additional functionality or forwarding, all must be const for multi-threading purposes
};

}

Currently, loki::pddl::SomeType is a non-owning raw pointer. In other words, it is a type that provides operator overloads for operator* and operator-> to access the underlying data. A better allocation strategy under the hood is something that we might want to change in the future. For example, we could profit from using better layouted memory, such as the proposal above, which is kind of comparable to the SegmentedVector as it is used in Fast Downward but a bit unsafer because references can become invalidated. Testing whether an object already exists with a Vector is too costly. We currently use an std::unordered_set for testing uniqueness in a factory which resulting in objects of each type floating around in memory, see here. We could add an additional SegmentedVector where uniquely constructed elements are inserted and returning a pointer to that element instead of the element from the std::unordered_set. This should take 5 lines of code for potentially better cache locality at almost zero costs since the insertion into the vector only occurs when the object is uniquely constructed. That is something that we can easily profile and try out later.

drexlerd commented 8 months ago

A little more clarification.

The main issue with above is that derefencing a loki::pddl::SomeType* is several magnitudes slower if loki::pddl::SomeType is not in the cache. We want accessing data members to be as cheap as in the value semantics case where loki::pddl::SomeType are stored continuously in memory, e.g., std::vector<loki::pddl::SomeType. The idea is to store N objects continuously and persistent in memory using a data structure like the following:

template<typename T, size_t N>
class SegmentedPersistentVector {
private:
    std::vector<std::array<T, N>> m_data;

    int m_block_index;
    int m_index_in_block;

    size_t m_size;
    size_t m_capacity;

    void increase_capacity() {
        // Add an additional array with capacity N (1 allocation on average)
        m_data.resize(m_data.size() + 1);
        // Move to the next free block
        ++m_block_index;
        // Set index to next free position in block
        m_index_in_block = 0;
        // Increase total capacity
        m_capacity += N;
    }

public:
    explicit SegmentedPersistentVector() : m_block_index(-1), m_index_in_block(0), m_size(0), m_capacity(0) { }

    T* push_back(T value) {
        // Increase capacity if necessary
        if (m_size >= m_capacity) {
            increase_capacity();
        }

        auto& block = m_data[m_block_index];

        // Take ownership of memory
        block[m_index_in_block] = std::move(value);
        // Fetch return value
        T* return_value = &block[m_index_in_block];
        // Set index to next free position in block
        ++m_index_in_block;

        ++m_size;

        return return_value
    }

    size_t size() const {
        return m_size;
    }

    size_t capacity() const {
        return m_capacity;
    }
};

jendrikseipp commented 8 months ago

Just some words of caution: the Segmented* data structures have been designed to store large amounts of memory, i.e., hundreds of MB. For Loki it seems that a simple non-owning pointer design or using plain references is the easiest and most maintainable solution.

Optimizing the Loki data structures for cache locality sounds like premature optimization. This only has an effect if all the other operations in a planner/validator/etc have negligible runtime.

I think it makes sense to develop the library in tandem with a use case, such as a plan validator. Then you'll directly see which API to offer and where optimizations are needed (if at all).

drexlerd commented 8 months ago

Yes, I fully agree.

Now, I understand that we can get pretty close to value semantics through pointers in the API.

I do not plan to add these optimizations now. I am just exploring a bit and trying to understand better the boundaries of the current design and API to make it possible to make these kinds of low level optimizations later. I do not want to write everything from scratch later or make API changes later.

My main objective is to tackle multi-threaded applications for learning and planning in the future, which is my highest priority to optimize for.

jendrikseipp commented 8 months ago

Multi-threaded learning and planning indeed sounds exciting!

drexlerd commented 8 months ago

The current approach won't scale well with multi-threaded applications because of heap contention. Ideally, we want memory allocations and de-allocations to occur very rarely. The main programming paradigm to address this issue is to use pre-allocated buffers and very rarely allocate or de-allocate memory. I plan to start using flatbuffers library from google which allows compact serializations into buffers and cheap unpacking from a pointers into the buffer without any heap allocations at all. Another advantage is that we obtain better data locality which can decrease the number of cache misses and further improve performance.

drexlerd commented 8 months ago

I checked out other libraries that are capable of serialization and zero-cost deserialization with zero heap-allocations. An alternative is cap'n'proto. The APIs are quite similar. I have a tendency towards flatbuffers because it offers a little bit more flexibility. It will be more complicated for users of Loki to work with this data structure but it will be much more efficient. Zero allocations is what we need in multi-threaded environments to scale well with the number of cores.

I will add a separate namespace flat_pddl and keep the more user friendly pddl data structure that we currently have.

drexlerd / Loki

PDDL memory representation in regards of usability and multi-threaded applications #11