PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.78k stars 210 forks source link

-Provider names are used as a prefix for job ids now. #134

Closed thebigG closed 3 years ago

thebigG commented 3 years ago

Hi everyone,

hope you are all doing well.

Description

Job ids are now created with the following format PROVIDER_ID. This should address #123. It partially addresses part of the issues discussed in #133. Hopefully #132 can move forward after this change.

Hopefully the changes make sense.

Context of change

Please add options that are relevant and mark any boxes that apply.

Type of change

Please mark any boxes that apply.

How Has This Been Tested?

Checklist:

Please mark any boxes that have been completed.

thebigG commented 3 years ago

123 is a bit confusing, sorry about that. It looks like the original error was a CAPTCHA error. But looking at the conversation it looks like at some point the user did get the key conflict this PR solves. So I guess this PR solves half of #123, if you will. The other half is the CAPTCHA issue which is inevitable. Obviously CAPTCHA is something users will encounter no matter, for a myriad of reasons. We should probably document these CAPTCHA issues on the README, because at the end of the day it's an easy problem to solve; one can go on the browser and solve the CAPTCHA for Indeed/Monster/etc manually.

PaulMcInnis commented 3 years ago

Right, maybe we can add a readme for that issue and then this can close that issue for now.

I agree that we shouldn't be trying to circumvent captcha.

thebigG commented 3 years ago

Right, maybe we can add a readme for that issue and then this can close that issue for now.

Just to confirm; do you want to modify the current readme or add a new document to JobFunnel explaining how to handle CAPTCHA? I was thinking of adding a section to the current readme called CAPTCHA which is very brief?

Now that I think about it, I remember you mentioning in the past that you want to keep the readme as short as possible. With that said, how about adding all of this CAPTCHA documentation to the wiki?

In the wiki, can even write a little tutorial on how to solve the CAPTCHA for JobFunnel.

PaulMcInnis commented 3 years ago

Lets add a brief statement to the effect of captcha not being circumvented, and that if you encounter scraper errors, to try opening a window.

Maybe at the end of the readme is fine.

PaulMcInnis commented 3 years ago

We may want to ensure we return the failed scraping url in the error message if we dont already

thebigG commented 3 years ago

We may want to ensure we return the failed scraping url in the error message if we dont already

It looks like we do for Monster and Indeed, which are the ones supported at the moment:

        if not num_res:
            raise ValueError(
                "Unable to identify number of pages of results for query: {}"
                " Please ensure linked page contains results, you may have"
                " provided a city for which there are no results within this"
                " province or state.".format(search_url)
            )
PaulMcInnis commented 3 years ago

Ok looks good, when we release next version we should note this change, thanks @thebigG !