apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.44k stars 3.52k forks source link

NativeDatasetFactory parameter allocator and memory pool relationship #37458

Open zinking opened 1 year ago

zinking commented 1 year ago

Describe the usage question you have. Please include as many useful details as possible.

@zhztheplayer

  /**
   * Constructor.
   *
   * @param allocator a context allocator associated with this factory. Any buffer that will be created natively will
   *                  be then bound to this allocator.
   * @param memoryPool the native memory pool associated with this factory. Any buffer created natively should request
   *                   for memory spaces from this memory pool. This is a mapped instance of c++ arrow::MemoryPool.
   * @param datasetFactoryId an ID, at the same time the native pointer of the underlying native instance of this
   *                         factory. Make sure in c++ side  the pointer is pointing to the shared pointer wrapping
   *                         the actual instance so we could successfully decrease the reference count once
   *                         {@link #close} is called.
   * @see #close()
   */
  public NativeDatasetFactory(BufferAllocator allocator, NativeMemoryPool memoryPool, long datasetFactoryId) {
    this.allocator = allocator;
    this.memoryPool = memoryPool;
    this.datasetFactoryId = datasetFactoryId;
  }

what's the relationship between allocator and memoryPool, does allocator allocate buffers from the memory pool or there is no such restrictions?

thanks

Component(s)

Java

zhztheplayer commented 1 year ago

If I remember correctly memoryPool is for allocating native data buffers for this dataset.

allocator doesn't actually allocate memory for data, but will take ownership of the buffers allocated by memoryPool while the buffers are transferred to Java side.

zinking commented 1 year ago

If I remember correctly memoryPool is for allocating native data buffers for this dataset.

allocator doesn't actually allocate memory for data, but will take ownership of the buffers allocated by memoryPool while the buffers are transferred to Java side.

@zhztheplayer are there actual usages that I can refer to. allocator usage still sounds abstract.

zinking commented 1 year ago

actually let me put it this way

  public BaseAllocator defaultAllocator() {
    NativeSQLMemoryConsumer consumer =
      new NativeSQLMemoryConsumer(taskMemoryManager, offHeapMemory);
    SparkAllocationListener al = new SparkAllocationListener(consumer);
    RootAllocator parent = ArrowDatasetUtil.rootAllocator();
    String name = "Spark Managed Allocator - " + UUID.randomUUID();
    return (BaseAllocator) parent.newChildAllocator(name, al, 0, parent.getLimit());
  }

  public NativeMemoryPool defaultMemoryPool() {
    NativeSQLMemoryConsumer consumer =
        new NativeSQLMemoryConsumer(taskMemoryManager, offHeapMemory);
    SparkReservationListener rl = new SparkReservationListener(consumer);
    return NativeMemoryPool.createListenable(rl);
  }

should the two use the same MemoryConsumer, or the allocator doesn't use memoryConsumer at all? here the NativeSQLMemoryConsumer is basically the same as MemoryConsumer.

@zhztheplayer