PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.78k stars 210 forks source link

JobFunnel: Failed to scrape jobs for IndeedScraperUSAEng #137

Closed evb-gh closed 3 years ago

evb-gh commented 3 years ago

Description

Running funnel with load -s settings_USA.yml gives the following error:

[2021-03-16 18:34:58,123] [INFO] JobFunnel: Scraping local providers with: ['IndeedScraperUSAEng']
[2021-03-16 18:34:59,720] [ERROR] JobFunnel: Failed to scrape jobs for IndeedScraperUSAEng
[2021-03-16 18:34:59,720] [INFO] JobFunnel: Completed all scraping, found 0 new jobs.
[2021-03-16 18:34:59,882] [INFO] JobFunnel: Done. View your current jobs in demo_job_search_results/demo_search.csv

Environment

Would like to debug further but not sure how to do it.

corielljacob commented 3 years ago

Also receiving this. Monster working fine but Indeed fails every time, even with different search keywords. Using DEBUG logging, I was able to get the URL it was trying to hit and it seemed fine.

Environment:

PaulMcInnis commented 3 years ago

Thanks for opening an issue, I think we have some long outstanding issues with Parsing of the search URL for certain queries, if you are open to sharing your search URLs from logs it would be very helpful to identify what the issue is.

We current have CI for the US Indeed scraper but it only performs a basic search.

Additionally, can you confirm that you are able to obtain results (non advertisement results) for the search you are performing on the Indeed website?

corielljacob commented 3 years ago

Sure. My jobfunnel has also been failing the Monster scrape the past few days (using crontab to run once daily). I would also try to debug if I could but I'm not very familiar with running python projects and I couldnt figure out how to run from PyCharm with the source 😅 URL: https://www.indeed.com/jobs?q=Software Engineer&l=tulsa%2C+OK&radius=25&limit=50&filter=0 JobFunnel

I also used the URL: https://www.indeed.com/jobs?q=Software&l=tulsa%2C+OK&radius=25&limit=50&filter=0 Just to see if maybe the space was throwing things off. That URL also failed.

PaulMcInnis commented 3 years ago

Ok, yeah looks like we need to improve the url parsing! Can you try instead searching for two separate keywords, like this:

- Software
- Engineer
PaulMcInnis commented 3 years ago

Oh i see that you tried with a single keyword as well, ok. I think this might be some other issue.

One thing to try is to use current master of this repo. You can do that by installing it in place with, pip install -e <path to this repo>

corielljacob commented 3 years ago

Went ahead and added the keywords separately like you mentioned anyway as well as installing the current master. However, it looks like still no change (was potentially already using current master) image

PaulMcInnis commented 3 years ago

Ah ok, thanks for being so responsive, we’ll have to take a deeper look.

If you are feeling confident I invite you to break execution in the scraper where we collect the number of pages of results from the search url, I suspect the issue is there since it ends up scraping no jobs.

corielljacob commented 3 years ago

I would be interested in doing some debugging, but I may need some advice with how I can do so from something like PyCharm (open to another IDE you recommend). This is a tad out of scope for the issue so pardon my intrusion. I am trying to run JobFunnel-master\jobfunnel__main__.py\ but doing so gets me an import error image

Like I mentioned, I'm not super familiar with running python, especially in a project like this so this may be completely the wrong place to try and start running 😅 but if you can point me in the right direction for how I might get to a point where I can set breakpoints and such, I'd be happy to play around.

PaulMcInnis commented 3 years ago

Unfortunately PyCharm doesn't work for this project due to use of abstract base classes.

The best way to debug is to add a import pdb; pdb.set_trace() in the code where you would like to debug

then you have access to a complete python interpreter, i.e. pp var_im_interested_in

marchbnr commented 3 years ago

You should be able to debug modules, such as jobfunnel, in pycharm like this: https://stackoverflow.com/a/51268846

evb-gh commented 3 years ago

If anyone reading this that has the time and knowledge can I ask you to write a step by step example of how to debug this code? I would like to understand how to debug this repo by running it from a local directory with either pyCharm, cli or emacs.

PaulMcInnis commented 3 years ago

RE pycharm, users have had issues using it with this repository in the past due to the ABC implementation: https://github.com/PaulMcInnis/JobFunnel/pull/90#issuecomment-683481157

I highly recommend just adding the line import pdb; pdb.set_trace() anywhere in the base scraper or indeed scraper and playing around with the available methods and variables (pp vars(self))

NOTE: to use pdb with multiprocessing.pool you will additionally want to set the number of workers to 1.

evb-gh commented 3 years ago

Thanks for the quick reply. I apologize if my questions seem lazy (I have very little experience with python) but how do I run the code with test parameters (location, keywords) from local cloned repository?

PaulMcInnis commented 3 years ago

Thanks for the quick reply. I apologize if my questions seem lazy (I have very little experience with python) but how do I run the code with test parameters (location, keywords) from local cloned repository?

totally fine, happy to help!

You should be able to run with test parameters by doing this:

wget https://git.io/JUWeP -O my_settings.yaml
funnel load -s my_settings.yaml
evb-gh commented 3 years ago

Running funnel load -s my_settings.yaml doesn't it run the code from /usr/local/bin/funnel which then executes code form /usr/local/lib/python3.9/site-packages/jobfunnel?

What I'm trying to do is:

  1. Clone the repo locally to ~/jobfunnel
  2. Add import pdb; pdb.set_trace() to indeed.py or base.py
  3. Run the code from ~/jobfunnel with my_settings.yml
  4. Debug
PaulMcInnis commented 3 years ago

Right i recommend doing this to have a test version of jobfunnel:

  1. git clone this repo somewhere
  2. checkout the branch you want to test
  3. virtualenv venv
  4. source venv/bin/activate
  5. pip3 install -e ./jobfunnel

When done you can exit virtualenv with deactivate

PaulMcInnis commented 3 years ago

Ok so i think the best place to start is indeed.py line 303 in the current master, query_resp.find returns None and I believe this is due to the encoding of the the request_html being incorrect somehow. I'm taking a look as well since I want this to work for everyone :P

<bound m�D������]���nd of <html><body><p>�J ��_�~�ް��уƽ����� O�
���#T��v�r�M����i�7����ϼ���r��v�'�C�F�!�c�W��
i���K��+^6�n�����hy\)���΋���Y���b!  j��Z��VH���k����L_���wР�BXk@��9B�N����$|�&gt;L����'�K�w�p�D��%6�c�*�    ��,�l���X&amp;l�h@0���%�� �E�r�D\��xP��nȸc�[��C8�qH��_l����V1��-{.��<tl4z>�Jj6���
!K�!�^��B��2�R�����6�u'hǐ��gB��8�����2"���]��|�^�X�%���`�qx7R����\M�j�tR\]N��.bj�Y���n�6Åp�qr �`����7��v���ҪBnr��,�������zٳ���k!��
PaulMcInnis commented 3 years ago

didn't mean to close this abruptly but I think the encoding was causing this. Please pull the latest changes and try, but this has resolved the issue on my end.