PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.78k stars 210 forks source link

[DISCUSSION] Duplicate Ids Across Different Scrapers and Return Code for Scraping #133

Closed thebigG closed 2 years ago

thebigG commented 3 years ago

Hello Everyone,

Hope you are all doing well.

Just wanted to start a discussion about some things as I much rather discuss them first before making a PR.

I want to discuss mainly two issues.

  1. The following code snippet:
            for exist_key_id in existing_job_keys:
                if inc_key_id == exist_key_id:
                    raise ValueError(
                        f"Inter-scraper key-id duplicate! {exist_key_id}"
                    )

This is on jobfunnel.py:218. This is currently what is holding #132 from being merged in. And I wonder if this is even the proper way of handling inter-scraper key conflicts? I could be missing something, but I think this function should not be throwing an exception. I say this because I'd think it is perfectly possible that two different sites may have the same job id and still have different jobs, right? This is what my intuition tells me, I don't have anything concrete to support this. And even if it so happened that two ids that are the same from two different sites means the same job, then I still don't think this function should throw an exception. It could just return a bool and log a warning. And if it must return an exception, it should at the very least be caught, which it is not at the moment. I hope these points make sense.

  1. The point of discussion has to do with the following code snippet:
    # Return value for Travis CI
    if (len(job_funnel.master_jobs_dict.keys()) > 1
            and os.path.exists(funnel_cfg.master_csv_file)):
        return 0
    else:
        return 1

This looks intentional and the purpose is clear: If there are no jobs in the dictionary, then return a non-zero code which means failure in most systems. The issue I see with this is that at least to me it seems that just because there are no jobs, it doesn't that there is a bug. This could very much be because of forces outside of JobFunnel; a CAPTCHA, the sites are down at the moment in time; the network is down at that moment in time, etc. I totally understand the intent of this, but I'm not sure if it is the best way of handling it because our CI could "fail" because of a CAPTCHA or because JobFunnel was't able to find any remote jobs, which is currently the case with our last GitHub Actions build.

Just wanted to start some discussion around this before I attempt to make any changes.

Cheers!

PaulMcInnis commented 3 years ago

Thanks for bringing this up!

  1. I believe this is because we key our dicts by the job id, this is raised because it prevents us from overwriting an existing job.
  2. this is done because we have chosen job searches which should always yield a number of results - an assumption of course, and I agree that this is too opaque.

For 1. We may want to consider prefixing job ids with the provider, then we can do away with this check entirely.

For 2. I think we just need better tests which potentially ensure that we are able to always get a result, and which test independently the failure cases

thebigG commented 3 years ago

I think your solution for 1 can work really well 👍.

For 2, yes, we need more testing. I've been trying to write more unit tests, but have been held back by issues. And I don't want to start writing tests for things that have pending issues, but we'll get there one day 😅.

PaulMcInnis commented 3 years ago

Ahah yea there is a lot of testing ground to cover, in an ideal world we have 100% coverage but this is open source after all 🤓.

To me priority ought to be getting our large city issue resolved and all else can take a back seat.

PaulMcInnis commented 3 years ago

I think the majority of this issue should be resolved by changes within #136, however we may have a limitation of relying on the provider URIs here.

We may want to add duplicate jobs to the CSV with a special annotation since it will still be possible if the Provider does not supply us independant URIs.