dpath-maintainers / dpath-python

A python library for accessing and searching dictionaries via /slashed/paths ala xpath.
MIT License
611 stars 91 forks source link

Filtering doesn't work as expected #40

Open xhh2a opened 9 years ago

xhh2a commented 9 years ago

I guess I will look into this sometime

>>> import dpath.util
>>> a = { 'actions' : [ { 'type': 'correct' }, { 'type': 'incorrect' } ] }
>>> dpath.util.values(a, 'actions/*')
[{'type': 'correct'}, {'type': 'incorrect'}]
>>> dpath.util.values(a, 'actions/*',afilter=lambda x: x.get('type', None) is 'correct')
[]
>>> def f(x):
>>>     print x
>>> dpath.util.values(a, 'actions/*',afilter=f)
[]
akesterson commented 9 years ago

what dpath version are you using?

akesterson commented 9 years ago

Nevermind, I can replicate this behavior on latest master.

akesterson commented 9 years ago

This has been broken for a long time, ever since values() was introduced, possibly even before. Really sorry about this one. I may not get around to a fix for this one tonight, but it's high on my list.

xhh2a commented 9 years ago

From a quick glance at the logic in the code yesterday, I think it is due to the way the code does in place filtering while it does the search so it fails on the top level branch of the dict before it hits a result where the filter would pass it. I was expecting a post dpath filter, not necessarily a bug. Although it didn't print anything so it could be something else.

Also I have a bug in the lambda should be .get() == 'correct', will test once I get into the office

xhh2a commented 9 years ago

So the typo doesn't change anything

akesterson commented 9 years ago

TL;DR - Filtering doesn't work the way any of us assumes it does!

So, after sleeping on this and pulling my head out of my buttcheeks, this boils down to a lack of good documentation. In fact, the lack of good docs is so bad that even I misunderstood what's actually going on behind the scenes, and how these functions work.

First, you're correct, filters are pre, not post. They are executed as each level of the source dictionary is parsed. Second, dpath treats dictionaries and lists as trees. Filters execute on the leaf nodes, e.g., the individual dictionary values; they do not execute on dictionaries or lists, which are branches on the tree, not leaf nodes. Consider:

>>> import dpath.util                                                                                                                                                                                                          
>>> a = { 'actions' : [ { 'type': 'correct' }, { 'type': 'incorrect' } ] }                                                                                                                                                     
>>> def f(x):                                                                                                                                                                                                                  
...     print x                                                                                                                                                                                                                
...                                                                                                                                                                                                                            
>>> dpath.util.values(a, 'actions/*',afilter=f)                                                                                                                                                                                
[]                                                                                                                                                                                                                             
>>> dpath.util.values(a, '**',afilter=f)                                                                                                                                                                                       
correct                                                                                                                                                                                                                        
incorrect                                                                                                                                                                                                                      
[]                                                                                                                                                                                                                             

The first call to values failed (and printed nothing) because actions/* matches only one level deep (not multi-level), and the only thing there is 2 dictionary objects, neither of which are leaf nodes, so the filter isn't executed on them, hence nothing gets printed (and not returned).

This behavior is not exclusive to values() either, the filtering happens in search, so search has the same issue:

>>> dpath.util.search(a, 'actions/*', afilter=f)
{}

.. If we expand the search range to "actions/**", we will reach the leaf nodes we want to filter against, and our print works:

>>> dpath.util.values(a, '**',afilter=f)
correct
incorrect
[]

... if we change our filter to actually give us some results now (with a .get() lambda like you had), we see now that this is obviously going to fail, because we're calling .get() on a string:

>>> dpath.util.values(a, 'actions/**', afilter=(lambda x: x.get('type', None) == 'correct'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "dpath/util.py", line 117, in values
    return [x[1] for x in dpath.util.search(obj, glob, yielded=True, separator=separator, afilter=afilter, dirs=dirs)]
  File "dpath/util.py", line 144, in _search_yielded
    val = dpath.path.get(obj, path, view=False, afilter=afilter)
  File "dpath/path.py", line 276, in get
    if (afilter and (not afilter(target))):
  File "<stdin>", line 1, in <lambda>
AttributeError: 'str' object has no attribute 'get'

... but it works great if we check against a simple string comparison, because we're getting the leaf nodes:

>>> dpath.util.values(a, 'actions/**', afilter=(lambda x: x == 'correct'))
['correct']
>>> dpath.util.search(a, 'actions/**', afilter=(lambda x: x == 'correct'))
{'actions': [{'type': 'correct'}]}

So, while filtering does work exactly as it was designed, this implementation clearly defies the rule of least astonishment. Even though the docs clearly state how it works:

... Filtering functions receive every terminus node in a search - e.g., anything that is not a dict or a list, at the very end of the path. For each value, they return True to include that value in the result set, or False to exclude it.

... I think it's worth considering changing how this works, or supporting some kind of alternate filtering mechanisms.

RFC, @calebcase, OP?

xhh2a commented 9 years ago

In terms of alternative post-search, I am able to handle it using other methods. Thus it's not that big of a deal if a pre filter is kept. Is it less efficient than doing it while stepping through the tree? Yes, but it's not something that should result in significant issues for me in my use case. Honestly I can't think of too many use cases where I would want a pre search, but I can see where others may want it.

I suppose the documentation could definitely use some clarification on this. Changing the filter type might cause backwards incompatibility for users though so a post-search might be better to be added as a separate arg param.

Incidentally I need to remember to test unicode/ascii with filter.

akesterson commented 9 years ago

What if we could say that all filter arguments must be subclasses of some base filter object, say dpath.Filter, and each Filter can implement one or more filtering steps, which would be called at various steps of the process? Then the user could just implement/override the steps they care about, and you could also make multi-step filters this way:

class MyFilter(dpath.Filter):
    def pre_leaf(self, leaf):
        """pre-filter a leaf node"""
        pass
    def post_leaf(self, leaf):
        """post-filter a leaf node"""
        pass
    def pre_branch(self, branch)
        """pre-filter a branch node (list or dictionary)"""
        pass
    def post_branch(self, branch):
        """post-filter a branch node (list or dictionary)"""
        pass

This would raise the question re: what to do with the existing filter functionality (being able to pass a lambda is awfully convenient), but I think this implementation would be significantly more clear.

akesterson commented 8 years ago

This seems like an excellent thing to fix in 2.0, re-labeling.

akesterson commented 4 years ago

Blocked by #136