PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.85k stars 215 forks source link

Check config after argparse #53

Closed markkvdb closed 4 years ago

markkvdb commented 4 years ago

Check config after argparse

Description

If the user provides an invalid configuration file, then the program will throw unclear errors. I think it's good practice to check all settings of the config dictionary and show a helpful error message in case the config is invalid.

Steps to Reproduce

  1. Provide an invalid configuration file (settings.yaml)
  2. Run funnel -s settings.yaml
studentbrad commented 4 years ago

Definitely a good idea. I believe this should be it's own file. That way it can be imported in __main__.py and executed just before config = parse_config(). Name could be something like: preparser.py validate.py

Just gonna throw it out there that with the json files I generated for test cases we could also validate that the location is a real place. That way we can also give warnings for users trying to use the tool in unsupported areas. May be a lot of work however.

markkvdb commented 4 years ago

I already started creating a configuration checker. Right now, I check the config after it has been parsed but it makes sense to do part of this before it is parse like you suggested.

What do you think of the following:

  1. Check whether all necessary field are provided in the yaml files and check whether they have the right types before parsing the configuration.
  2. Check whether all field have supported values after parsing the configuration. For example, we can check whether valid providers are given and check whether the region settings are valid.

The first step is really checking for supported settings keys and whether their values have the correct type, while the second step checks whether these values are actually make sense.

Let me know what you think 😄

studentbrad commented 4 years ago

Now that you mention it. Doing it after sounds much easier. It can be validated before being passed as args to JobFunnel. Let's do that.

markkvdb commented 4 years ago

Maybe parse_config can call a function validate_yaml(file) which does step 1 as described above and after obtaining config from parse_config we can check for sensible values in the validate.py file?

I have done a lot of this work already (locally).

PaulMcInnis commented 4 years ago

I think that the validation like you describe is a good approach - will save our users from providing invalid configs attributes and values.

markkvdb commented 4 years ago

Check out the parse-config branch for current progress. I will create a pull request in the near future but I'm currently brain-dead because I worked on it all afternoon so I think it's better to do this when I can think again.

studentbrad commented 4 years ago

Thanks for all of your hard work. If you don't mind I might suggest that we assert that the min_delay is not lower than 1 and that the max_delay is not lower than 10 in the validation. I had a quick discussion about this and we want to take every precaution necessary that this tool is not used to hit servers at an incredibly fast speed.

bunsenmurder commented 4 years ago

I glad you mentioned that, that'll help us avoid trouble down the road. Do you guys think we should just remove the option for turning off delaying?

markkvdb commented 4 years ago

I glad you mentioned that, that'll help us avoid trouble down the road. Do you guys think we should just remove the option for turning off delaying?

Yes, although I think that right now the set_delay isn't doing anything in the first place 😃. In the init function of JobFunnel the `args['set_delay'] is not used anywhere.

bunsenmurder commented 4 years ago

It meant for the parser; so if set_delay = False was set, the parser would skip parsing the delay config.

markkvdb commented 4 years ago

@bunsenmurder oh yes that makes sense. Now that I think about it, I think that I disabled set_delay in my new version of the parse_config proposed in #54. Adding it back takes one line but maybe it's better to remove the functionality altogether (like you suggested).

If everyone is okay with this then I can remove the set_delay functionality altogether in my pull request #54 since it's a minor thing.

markkvdb commented 4 years ago

This issue has been addressed in #54.