PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.78k stars 210 forks source link

Issues Parsing the Proxy Settings #124

Closed thebigG closed 3 years ago

thebigG commented 3 years ago

Hi everyone!

Hope you are all doing well.

Description

While testing the proxy capabilities I discovered that the proxy parsing/validation is not quite correct.

Steps to Reproduce

  1. Edit one of the settings files like settings_USA.yaml and uncomment the bottom of it:
 # Proxy settings
proxy:
 protocol: http  # NOTE: you can also set to 'http'
 ip: "168.169.146.12"
 port: 8080

NOTE: Notice how I removed the single quotes from the port field. This is something I'll fix when I push these fixes(along with unit testing for proxy.py)

  1. funnel load -s demo/settings_USA.yaml
  2. Wait to see the following error:

TypeError: _validate_type_ipv4address() missing 1 required positional argument: 'value'

The error stack is larger than that, but I'm trying to stay brief.

There are two issues happening here.

  1. The cerberus API does not pass 2 arguments to custom validators, like this function expects: def _validate_type_ipv4address(self, field, value): It expects 1. One can see this on the Cerberus API on this snippet of code which is on cerberus/validator.py:
                type_handler = self.__get_rule_handler('validate_type', _type)
                matched = type_handler(value)
            if matched:
                return

            # TODO uncomment this block on next major release
            #      when _validate_type_* methods were deprecated:
            # type_definition = self.types_mapping[_type]
            # if isinstance(value, type_definition.included_types) \
            #         and not isinstance(value, type_definition.excluded_types):  # noqa 501
            #     return

        self._error(field, errors.BAD_TYPE)
        self._drop_remaining_rules()

the function type_handler points to _validate_type_ipv4address, and as you can see it only passes 1 argument, value.

Easy fix: make _validate_type_ipv4address take 1 argument.

Once I fixed that, there was another issue:

Traceback (most recent call last):
  File "/home/lorenzogomez/.local/bin/funnel", line 11, in <module>
    load_entry_point('JobFunnel', 'console_scripts', 'funnel')()
  File "/home/lorenzogomez/PycharmProjects/JobFunnel/jobfunnel/__main__.py", line 15, in main
    cfg_dict = build_config_dict(args)
  File "/home/lorenzogomez/PycharmProjects/JobFunnel/jobfunnel/config/cli.py", line 317, in build_config_dict
    raise ValueError(
ValueError: Invalid Config settings yaml:
{'proxy': [{'ip': ['must be of ipv4address type']}]}

This has to do with two functions in the code.

First this function:

class JobFunnelSettingsValidator(Validator):
    """A simple JSON data validator with a custom data type for IPv4 addresses
    https://codingnetworker.com/2016/03/validate-json-data-using-cerberus/
    """
    def _validate_type_ipv4address(self, value):
        """
        checks that the given value is a valid IPv4 address
        """
        try:
            # try to create an IPv4 address object using the python3 ipaddress
            # module
            ipaddress.IPv4Address(value)
            return True
        except:
            self._error(value, "Not a valid IPv4 address")

Notice how I added return True at the end of the try statement.

The second function on cli.py:

 # Validate the config passed via YAML
        if not SettingsValidator.validate(config):
            raise ValueError(
                f"Invalid Config settings yaml:\n{SettingsValidator.errors}"
            )

This fucntion expects a value from _validate_type_ipv4address, but it returns None(at least the current version of it). Hence the new return True in _validate_type_ipv4address.

Hopefully this explanation makes sense.

I will be working on fixing these issues, and will also add unit testing to proxy.py.

Cheers!

Expected behavior

jobfunnel should scrape normally with proxy settings on.

Actual behavior

The errors described above.

Environment

PaulMcInnis commented 3 years ago

ah good find, this was some techdebt that wasn't captured - I never got around to validating this feature.