apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.56k stars 3.54k forks source link

[Java] DictionaryProvider leaks memory while adding dictionaries with duplicate encoding #32239

Open asfimport opened 2 years ago

asfimport commented 2 years ago

DictionaryProvider leaks memory while adding dictionaries with duplicate encoding. Is this expected? Should the provider release the memory of the existing dictionary vector if it accepts another one with same encoding id ?

Sample code:


"dictionaryProvider" should " not leak memory while adding dictionaries with duplicate encoding" in {

  val allocator: RootAllocator = new RootAllocator()

  val vector: ListVector = ListVector.empty("vector", allocator)
  val dictionaryVector1: ListVector = ListVector.empty("dict1", allocator)
  val dictionaryVector2: ListVector = ListVector.empty("dict2", allocator)

  val writer1: UnionListWriter = vector.getWriter
  writer1.allocate
  writer1.setValueCount(1)

  val dictWriter1: UnionListWriter = dictionaryVector1.getWriter
  dictWriter1.allocate
  dictWriter1.setValueCount(1)

  val dictWriter2: UnionListWriter = dictionaryVector2.getWriter
  dictWriter2.allocate
  dictWriter2.setValueCount(1)

  val dictionary1: Dictionary = new Dictionary(dictionaryVector1, new DictionaryEncoding(1L, false, None.orNull))
  val dictionary2: Dictionary = new Dictionary(dictionaryVector2, new DictionaryEncoding(1L, false, None.orNull))

  val provider = new DictionaryProvider.MapDictionaryProvider
  provider.put(dictionary1)
  provider.put(dictionary2)

  vector.clear()
  provider.getDictionaryIds.asScala.map(id => provider.lookup(id).getVector.clear())

  allocator.getAllocatedMemory shouldBe 0
} 

Reporter: Vimal Varghese

Note: This issue was originally created as ARROW-16920. Please see the migration documentation for further details.

asfimport commented 2 years ago

David Li / @lidavidm: CC @davisusanibar could you take a look?

That said, from what I see of when dictionary replacement/delta dictionaries in ARROW-7284 (https://github.com/apache/arrow/pull/5945), instead of replacing the dictionary entirely it seems the existing code mutates the dictionary (for both replacement and delta dictionaries) which is why the library itself doesn't run into this. The provider should probably either close the old dictionary or throw an error if the dictionary is replaced instead of mutated.

asfimport commented 2 years ago

David Dali Susanibar Arce / @davisusanibar: Base on the comment about "ensure java implementation meets clarified dictionary spec #5945":

 


A second dictionary batch for the same ID that is not a "delta batch"
in an IPC stream indicates the dictionary should be replaced. 

 

Current behavior is as expected as tested at https://github.com/apache/arrow/blob/f49547b1504d9a514fd847307a2a6715959dc5e1/java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java#L521:L525.

Please let me know if your proposal is to start a discussion to change this behavior? 

CC @lidavidm 

 


RootAllocator allocator = new RootAllocator();
ListVector vector = ListVector.empty("vector", allocator);
ListVector dictionaryVector1 = ListVector.empty("dict1", allocator);
ListVector dictionaryVector2 = ListVector.empty("dict2", allocator);
UnionListWriter writer1 = vector.getWriter();
writer1.allocate();
writer1.setValueCount(1);
UnionListWriter dictWriter1 = dictionaryVector1.getWriter();
dictWriter1.allocate();
dictWriter1.setValueCount(1);
UnionListWriter dictWriter2 = dictionaryVector2.getWriter();
dictWriter2.allocate();
dictWriter2.setValueCount(1);
Dictionary dictionary1 = new Dictionary(dictionaryVector1, new DictionaryEncoding(1L, false, new ArrowType.Int(8, true)));
Dictionary dictionary2 = new Dictionary(dictionaryVector2, new DictionaryEncoding(1L, false, new ArrowType.Int(8, true)));
DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
provider.put(dictionary1);
provider.put(dictionary2);
vector.clear();
for (Iterator<Long> it = provider.getDictionaryIds().iterator(); it.hasNext(); ) {
    long a = it.next();
    System.out.println(a);
}
System.out.println(allocator.getAllocatedMemory()); 
asfimport commented 2 years ago

David Li / @lidavidm: @davisusanibar I think this is purely an API design question. Note that the CDataDictionaryProvider has some comments on this too. If the behavior is expected, then it needs to be reflected in the docstring/API contract: dictionaries in a provider should be mutated, not directly replaced. And then we should raise an explicit error for the situation described here.

asfimport commented 2 years ago

David Dali Susanibar Arce / @davisusanibar: Ok @lidavidm , it is clear now, let me work on that

asfimport commented 2 years ago

David Dali Susanibar Arce / @davisusanibar: @lidavidm  What is the link docstring/API contract for Dictionary?

I am seeing docs/source/format and format but not sure where I could read docstring/API contract for Dictionary at this moment.

What is the link about comments for CDataDictionaryProvider also?

Thank you for your support.

asfimport commented 2 years ago

David Li / @lidavidm: I am talking about java docstrings. This is not anything to do with the format.

asfimport commented 2 years ago

Apache Arrow JIRA Bot: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per project policy. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.