dylan-profiler / visions

Type System for Data Analysis in Python
https://dylan-profiler.github.io/visions/visions/getting_started/usage/types.html
Other
203 stars 19 forks source link

src/visions/types/url.py passes non URLs #185

Open leapingllamas opened 2 years ago

leapingllamas commented 2 years ago

src/visions/types/url.py does not correctly validate URLs.

First, the example code (lines 14--19) from the docs do not return True:

Python 3.9.4 (default, Apr  9 2021, 09:32:38)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import visions
>>> from urllib.parse import urlparse
>>> urls = ['http://www.cwi.nl:80/%7Eguido/Python.html', 'https://github.com/pandas-profiling/pandas-profiling']
>>> x = [urlparse(url) for url in urls]
>>> x in visions.URL
False
>>> x
[ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment=''), ParseResult(scheme='https', netloc='github.com', path='/pandas-profiling/pandas-profiling', params='', query='', fragment='')]
>>> import pkg_resources
>>> pkg_resources.get_distribution("visions").version
'0.7.4'

Second, non URLs are passing:

>>> urlparse('junk') in visions.URL
True
>>>

The code should probably check something like the following for each element of x:

    try:
        result = urlparse(x)
        return all([result.scheme, result.netloc])
    except:
        return False

Finally, and this is a suggested enhancement, I think the behavior would be more useful if it handled raw strings and did the parsing internally without the caller having to supply a parser:

urls = ['http://www.cwi.nl:80/%7Eguido/Python.html', 'https://github.com/pandas-profiling/pandas-profiling']
>>> urls in visions.URL
True
ieaves commented 2 years ago

Hey @leapingllamas - thanks for your issue report. I'm still testing but this looks like both of those issues are a regression introduced here when we switched to backend specific implementations (pandas, numpy, python stdlib, etc...). If this is a blocker for you, try wrapping your test values into a pandas series and everything will work correctly again - i.e.

>>> import pandas as pd
>>> x = pd.Series(urlparse(url) for url in urls)
>>> x in visions.URL
True

The second issue

>>> urlparse('junk') in visions.URL
True
>>>

Is more interesting (and two separate questions). The first: technically this isn't a supported operation as urlparse('junk') isn't a sequence, however, the multidispatch library we use has dispatch resolution logic which looks for alternative methods which succeed on the input type.

Dispatch resolution details:

    If an exact match isn't registered, the next closest method is called (and cached).
    If the issubclass relation is ambiguous, mro position is used as a tie-breaker.
    If there are still ambiguous methods - or none - a custom TypeError is raised.
    Default and keyword-only parameters may be annotated, but won't affect dispatching.
    A skipped annotation is equivalent to : object.
    If no types are specified, it will inherently match all arguments.

It's basically finding the bugged python implementation above and caching it as an implementation on ParseResults. Probably something we should be aware of @sbrugman.

The second question is philosophical - what is a URL? That's going to vary based on individual use-case, we are using ParseResult objects as our standard representation. There's no way for us to know whether you're using relative URLs without explicit netloc or scheme so we take your word for it by default, you gave us a ParseResult and we believe you.

That being said, visions will handle raw strings and perform the parsing for you if you ask for it. The implementation we use actually ends up being basically the same as you proposed (with a few parallelization improvements for pandas) - you can find it here.

In order to get there you have to go through a typeset though. This sequence

urls = ['http://www.cwi.nl:80/%7Eguido/Python.html', 'https://github.com/pandas-profiling/pandas-profiling']

Is formally a sequence of strings which can be coerced to a URL (again, a URL is a ParseResult). Collections of types are combined into TypeSets connected through relations.

>>> vis.URL.get_relations()
[IdentityRelation(related_type=Object, type=None, relationship=lambda, transformer=identity_transform, inferential=False), InferenceRelation(related_type=String, type=None, relationship=default_relation, transformer=identity_transform, inferential=True)]

Although you can call those transformers directly from a type, it can be a little cumbersome as they are intended to be used from a typeset like this:

>>> urls = ['http://www.cwi.nl:80/%7Eguido/Python.html', 'https://github.com/pandas-profiling/pandas-profiling']
>>> typeset = vis.CompleteSet()
>>> typeset.infer(urls)
((ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment=''), ParseResult(scheme='https', netloc='github.com', path='/pandas-profiling/pandas-profiling', params='', query='', fragment='')), [Generic, Object, String, URL], {})

The first result is the inferred representation for your URLs, the second is the sequence of Types used to infer the representation, and the third is a cached state you mostly won't need.

ieaves commented 2 years ago

I meant to add, one of the nice things about visions is that it's entirely modular. If you'd prefer to define URLs as strings and bypass ParseResult objects altogether, you can do it! That would look like something like this:

from typing import Any, Sequence

from multimethod import multimethod

from visions.relations import IdentityRelation, TypeRelation
from visions.types.string import String
from visions.types.type import VisionsBaseType

class MyURL(VisionsBaseType):
    """**Url** implementation of :class:`visions.types.type.VisionsBaseType`.

    Examples:
        >>> from urllib.parse import urlparse
        >>> urls = ['http://www.cwi.nl:80/%7Eguido/Python.html', 'https://github.com/pandas-profiling/pandas-profiling']
        >>> x = [urlparse(url) for url in urls]
        >>> x in visions.URL
        True
    """

    @staticmethod
    def get_relations() -> Sequence[TypeRelation]:
        relations = [
            IdentityRelation(String),
        ]
        return relations

    @staticmethod
    @multimethod
    def contains_op(item: Any, state: dict) -> bool:
        try:
            result = (urlparse(x) for x in item) 
            return all(all([res.scheme, res.netloc]) for res in result)
        except:
            return False

Then, if you want to use it in a typeset all you have to do is

typeset = vis.StandardSet() + MyUrl
ieaves commented 2 years ago

Bug fix is available at #187 @leapingllamas. If you are interested in contributing your own type implementations let me know and I'd be happy to get you started.