Open asfimport opened 2 years ago
David Li / @lidavidm: CC @davisusanibar could you take a look?
That said, from what I see of when dictionary replacement/delta dictionaries in ARROW-7284 (https://github.com/apache/arrow/pull/5945), instead of replacing the dictionary entirely it seems the existing code mutates the dictionary (for both replacement and delta dictionaries) which is why the library itself doesn't run into this. The provider should probably either close the old dictionary or throw an error if the dictionary is replaced instead of mutated.
David Dali Susanibar Arce / @davisusanibar: Base on the comment about "ensure java implementation meets clarified dictionary spec #5945":
A second dictionary batch for the same ID that is not a "delta batch"
in an IPC stream indicates the dictionary should be replaced.
Current behavior is as expected as tested at https://github.com/apache/arrow/blob/f49547b1504d9a514fd847307a2a6715959dc5e1/java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java#L521:L525.
Please let me know if your proposal is to start a discussion to change this behavior?
CC @lidavidm
RootAllocator allocator = new RootAllocator();
ListVector vector = ListVector.empty("vector", allocator);
ListVector dictionaryVector1 = ListVector.empty("dict1", allocator);
ListVector dictionaryVector2 = ListVector.empty("dict2", allocator);
UnionListWriter writer1 = vector.getWriter();
writer1.allocate();
writer1.setValueCount(1);
UnionListWriter dictWriter1 = dictionaryVector1.getWriter();
dictWriter1.allocate();
dictWriter1.setValueCount(1);
UnionListWriter dictWriter2 = dictionaryVector2.getWriter();
dictWriter2.allocate();
dictWriter2.setValueCount(1);
Dictionary dictionary1 = new Dictionary(dictionaryVector1, new DictionaryEncoding(1L, false, new ArrowType.Int(8, true)));
Dictionary dictionary2 = new Dictionary(dictionaryVector2, new DictionaryEncoding(1L, false, new ArrowType.Int(8, true)));
DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
provider.put(dictionary1);
provider.put(dictionary2);
vector.clear();
for (Iterator<Long> it = provider.getDictionaryIds().iterator(); it.hasNext(); ) {
long a = it.next();
System.out.println(a);
}
System.out.println(allocator.getAllocatedMemory());
David Li / @lidavidm: @davisusanibar I think this is purely an API design question. Note that the CDataDictionaryProvider has some comments on this too. If the behavior is expected, then it needs to be reflected in the docstring/API contract: dictionaries in a provider should be mutated, not directly replaced. And then we should raise an explicit error for the situation described here.
David Dali Susanibar Arce / @davisusanibar: Ok @lidavidm , it is clear now, let me work on that
David Dali Susanibar Arce / @davisusanibar: @lidavidm What is the link docstring/API contract for Dictionary?
I am seeing docs/source/format and format but not sure where I could read docstring/API contract for Dictionary at this moment.
What is the link about comments for CDataDictionaryProvider also?
Thank you for your support.
David Li / @lidavidm: I am talking about java docstrings. This is not anything to do with the format.
Apache Arrow JIRA Bot: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per project policy. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.
DictionaryProvider leaks memory while adding dictionaries with duplicate encoding. Is this expected? Should the provider release the memory of the existing dictionary vector if it accepts another one with same encoding id ?
Sample code:
Reporter: Vimal Varghese
Note: This issue was originally created as ARROW-16920. Please see the migration documentation for further details.