Cytnx Storage Refactor Proposal

IvanaGyro commented 5 days ago

Cytnx Storage Refactor Proposal

Introduction

Per off-line (not on GitHub) discussions on 2024.10.03 and 2024.11.06, some chapters of this proposal are removed from the original unpublished draft. Only the chapters that got consensus remained.

This proposal outlines a plan for refactoring the Storage and Tensor classes, along with their respective implementation classes. The goal of this refactoring is to enhance the efficiency of the program and minimize human errors. Please note that this proposal is not finalized and does not constitute a definitive list of items to be implemented. We encourage discussions regarding any details of the proposed refactoring plan.

Importantly, this proposal DOES NOT alter the APIs for Python.

Advantages

Refactoring takes time. There is no need to refactor the code if there is no advantage. Here are some advantages we will have if we apply the plan mentioned in the following sections.

Reduced Code Volume: A smaller codebase will facilitate easier comprehension for newcomers.
Shorter Build Times: A reduced codebase will lead to faster build times.
Elimination of boost::intrusive_ptr Dependencies: Currently, we utilize two components from Boost: intrusive_ptr and unordered_map. By replacing boost::unordered_map with std::unordered_map, we can eliminate our dependency on Boost.
Decreased Human Error: By encapsulating the Storage class, we can lower the likelihood of contributors inadvertently accessing memory incorrectly.
Enabled compile time check: The proposed changes allow for more bugs to be identified at compile time, enabling the compiler to manage type conversions effectively.
Flexibility for Other Backends: While we currently support PyTorch as a backend, the current design binds the UniTensor class to PyTorch's API. This proposal aims to decouple them.

Class Diagrams

Below are class diagrams illustrating the current and proposed designs. Both diagrams omit specifiers such as const, virtual, etc., for clarity.

---
title: Current Design
---
classDiagram
  class Storage {
    +Init(...) void
    +append~T~(...) void
    +at~T~(...) T
    +to_(int device) void
    +to(int device) intrusive_ptr~Storage_base~
    +fill(const T &val) void
    +static from_vector(...) Storage
    +boost::intrusive_ptr~Storage_base~ _impl
  }
  class Storage_base {
    +Init(...) void
    +_Init_byptr(...) void
    +_create_new_sametype() intrusive_ptr~Storage_base~
    +clone() intrusive_ptr~Storage_base~
    +Move_memory(...) intrusive_ptr~Storage_base~
    +Move_memory_(...) void
    +to_(int device) void
    +to(int device) intrusive_ptr~Storage_base~
    +PrintElem_byShape(...) void
    +print_elems() void
    +real() intrusive_ptr~Storage_base~
    +imag() intrusive_ptr~Storage_base~
    +intrusive_ptr~Storage_base~ _impl
    +void *Mem
    +unsigned long long len
    +unsigned long long cap
    +unsigned int dtype
    +int device
  }
  class ComplexDoubleStorage
  class FloatStorage
  class intrusive_ptr_base~T~
  namespace boost {
    class intrusive_ptr~T~
  }
  class Tensor {
    +intrusive_ptr~Tensor_impl~ _impl
    +Init(...) void
  }
  class Tensor_impl {
    -Storage _storage
    Init(...) void
    +to_(int device) void
    +to(int device) intrusive_ptr~Tensor_impl~
    +at~T~(locator) T &
    +at(locator) Sproxy
  }
  class UniTensor {
    +intrusive_ptr~UniTensor_base~ _impl;
  }
  class UniTensor_base {

  }
  class BlockUniTensor {
    +vector~Tensor~ _blocks
  }
  class DenseUniTensor {
    +Tensor _block
  }

  Storage "1..*" o-- "1" Storage_base
  intrusive_ptr_base~Storage_base~ <|-- Storage_base
  intrusive_ptr_base~T~ <.. intrusive_ptr~T~
  Storage_base <|-- ComplexDoubleStorage
  Storage_base <|-- FloatStorage
  Tensor "1..*" o-- "1" Tensor_impl
  intrusive_ptr_base~T~ <|-- Tensor_impl
  Tensor_impl "1" o-- "1" Storage
  intrusive_ptr_base~Tensor_impl~ <|-- Tensor_impl
  intrusive_ptr_base~UniTensor_base~ <|-- UniTensor_base
  UniTensor "1..*" o-- "1" UniTensor_base
  UniTensor_base <|-- BlockUniTensor
  UniTensor_base <|-- DenseUniTensor
  BlockUniTensor "1" *-- "0..*" Tensor
  DenseUniTensor "1" *-- "1" Tensor

And here is the class diagram of the new design. The diagram only shows the stuff changed. The levels above from class UniTensor are not affected. Every interface only shows the methods to be implemented for bridging other backends.

---
title: New Design
---
classDiagram
  class intrusive_ptr_base~T~
  namespace boost {
    class intrusive_ptr~T~
  }

  class StorageBase {
    <<Interface>>
    +data() void*
    +size() size_t
    +capacity() size_t
    +resize(size_t size) void
    +reserve(size_t size) void
    +dtype() int
    +device() int
    +clone() unique_ptr~StorageBase~
    +to(int device) unique_ptr~StorageBase~
  }
  class Storage {
    +append(cytnx_variant value) void
    +at(...) cytnx_reference_variant
    +clone() unique_ptr~StorageBase~
    +to(int device) unique_ptr~StorageBase~
    +fill(cytnx_variant value) void
    -unique_ptr~StorageBase~ implementation_
  }
  class StorageImplementation~T, Allocator~ {
    +clone() unique_ptr~StorageBase~
    +to(int device) unique_ptr~StorageBase~
    +friend ostream& operator<<() void
  }
  class StorageFactory {
    <<Interface>>
    +MakeStorage(int dtype, int device): unique_ptr<StorageBase>
  }
  class StorageFactoryImplementation {
    +MakeStorage(int dtype, int device): unique_ptr<StorageBase>
  }
  class TensorBase {
    <<Interface>>
    +at(locator) cytnx_reference_variant
    +reshape_(shape) TensorBase&
    +contiguous_() TensorBase&
    +permute_(vector~size_t~ dimensions) TensorBase&
    +shape() const vector~size_t~&
    +dtype() int
    +device() int
    +clone() unique_ptr~TensorBase~
    +to(int device) unique_ptr~TensorBase~
    -data() void*
    -HasContiguousIterator() bool
    %% do not force to have an implementation of Storage
  }
  class Tensor {
    +real() Tensor
    +imag() Tensor
    +clone() unique_ptr~TensorBase~
    +to(int device) unique_ptr~TensorBase~
    -unique_ptr~TensorBase~ implementation_
  }
  class TensorImplementation~T, Allocator~ {
    +clone() unique_ptr~TensorBase~
    +to(int device) unique_ptr~TensorBase~
    +at(locator) T
    -StorageImplementation~T, Allocator~ storage_
  }
  class TensorFactory {
    <<Interface>>
    +MakeTensor(int dtype, int device): unique_ptr<TensorBase>
  }
  class TensorFactoryImplementation {
    +MakeTensor(int dtype, int device): unique_ptr<TensorBase>
  }
  class UniTensor {
    +intrusive_ptr~UniTensor_base~ _impl;
  }
  class UniTensor_base {
  }
  class BlockUniTensor {
    +vector~Tensor~ _blocks
  }
  class DenseUniTensor {
    +Tensor _block
  }

  intrusive_ptr_base <.. intrusive_ptr
  StorageBase <|.. Storage
  StorageBase <|.. StorageImplementation
  Storage "1" *-- "1" StorageImplementation
  StorageFactory <|.. StorageFactoryImplementation
  StorageFactory <-- Storage
  TensorBase <|.. Tensor
  TensorBase <|.. TensorImplementation
  Tensor "1" *-- "1" TensorImplementation
  TensorImplementation "1" *-- "1" StorageImplementation
  TensorFactory <|.. TensorFactoryImplementation
  TensorFactory <-- Tensor
  intrusive_ptr_base~UniTensor_base~ <|-- UniTensor_base
  UniTensor "1..*" o-- "1" UniTensor_base
  UniTensor_base <|-- BlockUniTensor
  UniTensor_base <|-- DenseUniTensor
  BlockUniTensor "1" *-- "0..*" Tensor
  DenseUniTensor "1" *-- "1" Tensor

Proposed Changes

Templatization of Implementations of Storage and Tensor

The current design repeatedly defines implementations for Storage, resulting in numerous duplicated functions. Additionally, Tensor_impl lacks type information, preventing compile-time type checking and leading to runtime checks that can degrade performance.

class ComplexDoubleStorage : public Storage_base {}
class FloatStorage : public Storage_base {}
...
// Tensor_impl doesn't provide the information about the type of data it contains to the compiler.
class Tensor_impl {
private:
  Storage _storage;
}

The new design templatizes the implementations of Storage and Tensor to enable type checking at the compile time. The compiler can also help to determine if one type can convert to the other type. The new design also makes the containers for CPU and the containers for GPU different types. By separating them, we can easily use the STL container, like std::vector to maintain the memory. And we only open the door of using thrust::vector to maintain the memory both on CPU and GPU.

template <typename T, tpyename Allocator>
class StorageImplementation {
private:
  std::vector<T, Allocator> storage_;
}

template <typename T, tpyename Allocator> 
class TensorImplementation {
private:
  StorageImplementation<T, Allocator> storage_;
}

StorageImplementation<cytnx_int64, std::allocator<cytnx_int64>>; // Allocating the memory on CPU
StorageImplementation<cytnx_int64, GpuAllocator<cytnx_int64>>; // Allocating the memory on GPU

Encapsulation of All Classes

Using class Storage_base as an example:

class Storage_base {
public:
  void *Mem;
  unsigned long long len;
  unsigned long long cap;
  unsigned int dtype;
  int device;
  intrusive_ptr<Storage_base> _impl;
}

Makes member variables public increasing the risk of illegally accessing the memory.

void foo(Storage_base& storage) {
  // A contributor forgets to update the number of values contained.
  void* old_ptr = storage.Mem;
  storage.Mem = malloc(2, /* size corresponding to dtype */ 4);
  free(old_ptr);
}

// The other contributor uses `foo()`
Storage_base storage; // contains 10 values
foo(storage);
std::cout << reinterpret_cast<double*>(storage.Mem)[9]; // illegal access

Besides, both Google Style Guide and C++ Core Guidelines prefer private member variables.

Here is a change for class Storage_base.

template <typename T, typename Allocator>
class StorageImplementation : public StorageBase {
public:
  StorageImplementation(int dtype, int device):dtype_(dtype), device_(device) {}

  int dtype() { return dtype_; }
  int device() { return device; }
  size_t size() { return storage_.size(); } // replace `len`
  T* data() { return storage_.data(); } // replace `Mem`
  size_t capacity() { return storage_.capacity(); } // replace `cap`
  void resize(size_t count) { storage_.reize(); } // Use encapsulated interface to modify the values of `len`, `cap`, and `Mem`.
private:
  int dtype_;
  int device_;
  std::vector<T, Allocator> storage_;
}

The access control of the member variables changed or removed in the class diagram of the new design will be updated.

Creation of Interfaces: `class StorageBase` and `class TensorBase`

The introduction of these interfaces allows for a cleaner separation between interface definitions and their implementations, enhancing flexibility for future integrations with other libraries. There are two options for implementation.

Added on 2024.11.08: This proposal requests the developers to implement the type-less StorageBase and TensorBase while integrating with other libraries. However, by following with #496, it may be also possible to request the develops to implement typed StorageBase<typename> and TensorBase<typename>.

Option 1 (Not Selected)

This option proposed making existing classes base classes for implementations from other libraries but presented several issues including forward declaration requirements and mixed responsibilities.

class StorageFactory; // Need forward declaration here.

class Storage {
public:
  Storage(int dtype, int device, StorageFactory factory = StorageFactoryImplementation()) {
    implementation_ = factory.MakeStorage(dtype, device);
  }

  virtual ~Storage() = default;

  virtual cytnx_reference_variant at(size_t position) {}

protected:    
  Storage() = default; // leave `implementation_` null

private: 
  std::unique_ptr<Storage> implementation_;
};

class StorageFactory {
public:
  virtual std::unique_ptr<Storage> MakeStorage(int dtype, int device) = 0;
}

class CustomUint64Storage : public Storage {
public:
  // Can omit ":Storage()" because Storage::Storage() is the default constructor.
  CustomUint64Storage():Storage() {}

  // Is overriding `at()` neccessary? It is not clear because we cannot mark any
  // member function of `class Storage` as a pure virtual function.
private:
  StorageFromOtherLibrary storage_;
}

class StorageFactoryForOtherLibrary : public StorageFactory {
public:
  virtual std::unique_ptr<Storage> MakeStorage(int dtype, int device) { return new CustomUint64Storage(); }
}

// usage:
Storage storage(Type.Uint64, Device.cpu, StorageFactoryForOtherLibrary());
// storage.implementation_ is an instance of `class CustomUint64Storage`

There are three problems with this option.

There must be a forward declaration, which hides a dependency and violates the style guide.
Mixed the interface with its representation. The user who wants to use the backend provided by the other libraries will not be clear about which functions are necessary to be overridden.
Every implementation class, like class CustomUint64Storage, contains the unused variable implementation_ which wastes memory.

Option 2 (Selected)

The selected option introduces abstract interfaces (class StorageBase and class TensorBase) that must be implemented by any concrete storage or tensor class:

class StorageBase {
  ...
}
class Storage: public StorageBase {
public:
  Storage(int dtype, int device, StorageFactory factory = StorageFactoryImplementation()) {
    implementation_ = factory.MakeStorage(dtype, device);
  }

  ~Storage() = default;

  cytnx_reference_variant at(size_t position) override {}

protected:    
  // With option 2, the default constructor is not needed.
  // Storage() = default;

private: 
  std::unique_ptr<Storage> implementation_;
};

class StorageFactory {
public:
  virtual std::unique_ptr<Storage> MakeStorage(int dtype, int device) = 0;
}

class CustomUint64Storage : public StorageBase {
public:
  CustomUint64Storage() {}

private:
  StorageFromOtherLibrary storage_;
}

class StorageFactoryForOtherLibrary : public StorageFactory {
public:
  virtual std::unique_ptr<Storage> MakeStorage(int dtype, int device) { return new CustomUint64Storage(); }
}

// usage:
void foo(const StorageBase& storage) {
  // do something with the storage
}

Storage storage(Type.Uint64, Device.cpu, StorageFactoryForOtherLibrary());
// storage.implementation_ is an instance of `class CustomUint64Storage`
foo(storage);
foo(CustomUint64Storage()); // Encourage to use the class to tell which type of data is contained.

Except for the public member functions this proposal appeals to remove, the public member functions of class StorageBase and class TensorBase will copy from the current class Storage and class Tensor respectively. Below are the pure virtual functions of class StorageBase and class TensorBase.These pure virtual functions must be overridden by the implementation.

class StorageBase {
public:
  virtual void* data() = 0;
  virtual size_t size() = 0;
  virtual size_t capacity() = 0;
  virtual void resize(size_t size) = 0;
  virtual void reserve(size_t size) = 0;
  virtual int dtype() = 0;
  virtual int device() = 0;
  virtual unique_ptr<StorageBase> clone() = 0;
  virtual unique_ptr<StorageBase> to(int device) = 0;
}

class TensorBase {
public:
  virtual cytnx_reference_variant at(locator) = 0;
  virtual TensorBase& reshape_(shape) = 0;
  virtual TensorBase& contiguous_() = 0;
  virtual TensorBase& permute_(vector<size_t> dimensions) = 0;
  virtual const vector<size_t>& shape() = 0;
  virtual int dtype() = 0;
  virtual int device() = 0;
  virtual unique_ptr<TensorBase> clone() = 0;
  virtual unique_ptr<TensorBase> to(int device) = 0;

private:
  virtual void* data() = 0;
  virtual bool HasContiguousIterator() = 0;
}

There is no function for getting Storage from TensorBase because we don't expect other libraries to have the component equivalent to Storage. With this assumption, it means we have to change the arguments of the functions consumingclass Storage to the iterators and the raw pointers.

The reason for adding HasContiguousIterator() is to support the tensor implementation which stores the data in a multi-dimensional data structure. We may consider providing alternative and less efficient algorithms, like Exp(), for that kind of tensor implementation. If we only want to support the tensor implementations storing data in a one-dimensional array, we should consider replacing the whole Storage component with vector<T> because the functionality of Storage is almost equal to the functionality of vector.

Discussion on Removal of Types

To facilitate type elimination in the C++ API, it is crucial to retain the clone() functions, as well as the class StorageBase, class TensorBase, class Storage, and class Tensor. If we only remove types during the Python binding process, these components will no longer be necessary, resulting in a cleaner and more straightforward codebase.

The usage of clone() will be further discussed in the next section.

Blind Storage from Shapes

Another area for improvement is how Storage interacts with shapes. The Storage class currently relies on shape information that is passed to its member functions by the caller, which can lead to implicit coupling between storage management and shape handling.

To foster better separation of concerns, we propose moving the functions below into Tensor from Storage.

void GetElem_byShape_v2(
    boost::intrusive_ptr<Storage_base> &out,
    const std::vector<cytnx_uint64> &shape,
    const std::vector<std::vector<cytnx_uint64>> &locators,
    const cytnx_uint64 &Nunit);
void GetElem_byShape(
    boost::intrusive_ptr<Storage_base> &out,
    const std::vector<cytnx_uint64> &shape,
    const std::vector<cytnx_uint64> &mapper,
    const std::vector<cytnx_uint64> &len,
    const std::vector<std::vector<cytnx_uint64>> &locators);
void SetElem_byShape(
    boost::intrusive_ptr<Storage_base> &in,
    const std::vector<cytnx_uint64> &shape,
    const std::vector<cytnx_uint64> &mapper,
    const std::vector<cytnx_uint64> &len,
    const std::vector<std::vector<cytnx_uint64>> &locators,
    const bool &is_scalar);
void SetElem_byShape_v2(
    boost::intrusive_ptr<Storage_base> &in,
    const std::vector<cytnx_uint64> &shape,
    const std::vector<std::vector<cytnx_uint64>> &locators,
    const cytnx_uint64 &Nunit,
    const bool &is_scalar);
virtual boost::intrusive_ptr<Storage_base> Move_memory(
    const std::vector<cytnx_uint64> &old_shape,
    const std::vector<cytnx_uint64> &mapper,
    const std::vector<cytnx_uint64> &invmapper);
virtual void Move_memory_(
    const std::vector<cytnx_uint64> &old_shape,
    const std::vector<cytnx_uint64> &mapper,
    const std::vector<cytnx_uint64> &invmapper);
virtual void PrintElem_byShape(
    std::ostream &os,
    const std::vector<cytnx_uint64> &shape,
    const std::vector<cytnx_uint64> &mapper = {});

With the new design, we can consider eliminating the entire Storage component, as it essentially functions as a one-dimensional array. This functionality can be effectively replaced by using std::vector or thrust::vector as a member variable within the Tensor component. By making this change, we can streamline the implementation and reduce complexity in the codebase.

Steps

Following these steps will enable the compiler to detect more mistakes made during the refactoring process.

[ ] Make all member variables private.
[ ] Convert class Storage_base into an abstract class.
[ ] Replace USIInit, Storage::Init(), Storage_base::Init(), and Storage_base::_Init_byptr() with constructors and factory classes.
[ ] Templatize utils_internal::Movemem_cpu_x, utils_internal::Fill_gpu, and their CUDA counterparts.
[ ] Templatize class Storage_base.
[ ] Add template information to class Tensor_Impl.
[ ] Design iterators for Storage.
[ ] Create class StorageBase and class TensorBase.
[ ] Update functions that consume class Storage or class Tensor to instead consume class StorageBase, class TensorBase, or raw pointers.
[ ] Create a custom tensor class for bridging with PyTorch.

ianmccul commented 5 days ago

Looks good to me -- some minor points:

StorageImplementation doesn't need to store the dtype, since that is just the index of T in the enumeration, and it certainly shouldn't be stored as an int that could get accidentally changed. In my sketch in #496 I have a constexpr templated variable, so StorageImplementation<T, Allocator>::dtype() could be implemented simply as { return dtype_index<T>; }.
Where the data is stored (CPU, GPU etc) is effectively part of the type of StorageImplementation via the Allocator parameter, but only indirectly. It might be better to make this more explicit and extensible, with the idea that other storage devices might be added in the future, eg it might be 'stored' on another node of an MPI calculation, or on disk.
Storage_base should definitely be a template, even if nothing else is. Replace the void* with a T*, or better still, a templated type. Something to consider: is it possible to add a storage type that isn't directly accessible via a pointer? The Thrust library uses device_ptr<T>, which is what ought to be used to denote the underlying storage. You really shouldn't reinterpret_cast between things like a device_ptr<T> and void* (do they even have the same size?!)
I don't understand the Storage class in Option 2, it contains a std::unique_ptr<Storage>, i.e. a pointer to its own type? How does that work? Is the intent that Storage contains a std::unique_ptr<StorageBase> ? In that case, Storage itself is effectively a wrapper around the pointer to StorageBase, and shouldn't inherit from StorageBase itself.
Similarly, Storage_base contains intrusive_ptr<Storage_base> _impl;, which I think cannot be intended.

IvanaGyro commented 5 days ago

Thank @ianmccul for commenting on this proposal.

I will include the implementation in #496
Because this proposal uses std::vector to store data, the type of allocator must be decided at the compilation time. Currently, the memory accessed by GPU is allocated with Unified Memory Access(UMA) feature. CUDA handles the transfer between CPU and GPU. So that we only have to assigned an allocator designed for CUDA at the compilation time to support both CPU and GPU computation. If we want to decide the type of allocator at runtime, the implementation will be more complex. It can be a feature in the future.
and 5. Storage_base is the current implementation. It will be dropped.
You are correct. There is a typo.

Cytnx-dev / Cytnx