Unstructured-IO / unstructured-ingest

Apache License 2.0
15 stars 14 forks source link

Memory adress issue in postprocess method of Partitioner Class #196

Open naelsen opened 3 days ago

naelsen commented 3 days ago

It appears there is an issue with memory address issue in the postprocess method of the Partitioner class (see here) at line 123-124:

in_list = self.config.fields_include
elem = {k: v for k, v in elem.items() if k in in_list}

This line creates a new dictionary and assigns it to the elem variable, which causes the original dictionary object's reference to be lost. This can cause problems when you need to maintain the original memory address of elem for further operations, such as in flatten_metadata.

To resolve this, instead of reassigning elem to a new dictionary, we can iterate over a copy of the keys and remove value that are not in fiel_include variable:

for k in elem.copy().keys():
    if k not in self.config.fields_include:
        elem.pop(k)
mateuszkuprowski commented 2 days ago

You certainly have a valid point here! I would propose a slightly different solution however:

keys_to_remove = set(elem) - set(self.config.fields_include)
for key in keys_to_remove:
    del elem[key]

I think we're solving the same problem problem here, but in slightly more efficient way with avoiding copy and adding a little bit more readability with direct use of difference. @rbiseck3 WDYT?