EssamWisam / cmp-docs

A comprehensive guide for prospective, current and past students in the computer engineering department of Cairo university.
https://cmp-docs.pages.dev
57 stars 8 forks source link

⏰ Reminder to Run LinkedIn Scraper #67

Closed github-actions[bot] closed 4 months ago

github-actions[bot] commented 5 months ago

It's been two weeks.

It's time to run the LinkedIn script to get the latest titles and current positions of CMP students and graduates.

To run the script follow the steps below and if it's not your first time running the script, you can just start from step 4:

Steps for running the LinkedIn script:

  1. Make sure you have Python 3 downloaded on your device. You can check by running the command below in your bash terminal and it should display the Python version if it is already installed.
    python --version
  2. Install all the needed Python packages using the requirements.txt present in the scripts/linkedin-scraper directory.
    pip install -r "scripts/linkedin-scraper/requirements.txt"
  3. Download the Chrome Driver that is compatible with your OS and Chrome Version from this link. It should be a zip file of about 10 MBs or less. Extract it using WinRAR or a similar archive manager. Then copy the chromedriver.exe file to the scripts/linkedin-scraper directory.
  4. Set the enivronment variables with valid LinkedIn credentials in the bash terminal as following:
    export LINKEDIN_SCRAPER_EMAIL=<email>
    export LINKEDIN_SCRAPER_PASSWORD=<password>

    and replace <email> and <password> with the actual LinkedIn credentials. Note, you should probably avoid using your main LinkedIn account credentials to avoid running the risk of it being banned by LinkedIn after multiple scraping.

  5. Finally, you can run the script on all the class yaml files using the command below:
    python "scripts/linkedin-scraper/run.py" 

    and if you want to run the script for a certain class only, use the command below and replace 20XX with the graduation year of said class:

    python "scripts/linkedin-scraper/linkedin-scraper.py" "public/department/Extras/Classes/C20XX.yaml"

Last Notes:

EssamWisam commented 5 months ago

My turn!

Iten-No-404 commented 5 months ago

Good luck! If you face any issues, let me know.

Iten-No-404 commented 4 months ago

@EssamWisam, a new issue (https://github.com/EssamWisam/cmp-docs/issues/68) has been created since it has been 2 more weeks. If you are blocked for some reason, let me know. If, however, you are just busy and have the run scheduled for a later date, then no problem and take your time.

EssamWisam commented 4 months ago

I just scheduled to do this at a particular time then could not start because time was gone elsewhere then remained busy and forgot. I will schedule to do this today inshAllah and write back here if anything blocks me.

EssamWisam commented 4 months ago

@Iten-No-404

May I ask, how much does the script typically require to run? I think I waited for about two hours or so yesterday and it was still going (then I slept and closed laptop lid but I think then it was not able to continue).

I am unable to scroll up in the output traceback to show some messages it was throwing me but it was about many missing profile pictures for 2023 students (including students I thought do have a profile picture).

The main issue in any case remains time. Could you let me know how long it typically takes.

Iten-No-404 commented 4 months ago

@EssamWisam, it typically takes around 3 hours to fully run. However, I believe the problem you're facing could be due to one or more of the following reasons:

  1. Your internet connection is slow.
  2. You entered your login credentials incorrectly and consequently the script wasn't able to log you in. Make sure you have them directly after the = sign don't use quotes like so LINKEDIN_SCRAPER_PASSWORD="" since it doesn't work for some reason.
  3. The LinkedIn account you're using was flagged and banned for ~24 hours due to multiple logins.

To try and debug or fix this issue, you can try the following steps:

  1. First, make sure that the login credentials are entered correctly and observe the UI screen that appears when you run the script for bit to make sure that:
    • the login was successful &
    • the account isn't banned &
    • there is no verification needed. (If prompted, stop running the script, login in a browser with the same LinkedIn credentials, finish the verification prompt, and it shouldn't prompt your account again for a while).
  2. Secondly, you can try running the script on a single yaml file to make sure that everything is working properly, for example:
    python "scripts/linkedin-scraper/linkedin-scraper.py" "public/department/Extras/Classes/C2023.yaml"

    If running the script on a single YAML file was successful, then you can just run it again individually on C2020, C2021, & C2024_Credit. (since they are the only ones with LinkedIn links beside C2023)

If you are still facing any issues, let me know. Also, I can run the script this weekend if the issues still persist.

EssamWisam commented 4 months ago

Being abroad, I doubt the reason is a slow connection and I used my main account (was that wrong?) and I think output would have been different if it didn't log in.

I didn't know that it takes three hours. I maybe expected much less and surely slept by or before two hours or something. I will try a run again on 2023 only. With this duration of runtime, I think we should run this every month.

Iten-No-404 commented 4 months ago

I used my main account (was that wrong?)

Not wrong per se but definitely not recommended for 2 reasons: it will alert everyone whose account you parsed that you viewed their profile, and more importantly your LinkedIn account will run the risk of being flagged or banned on the long run which can't be a good thing especially if you have invested a lot of effort into it.

I didn't know that it takes three hours.

With this duration of runtime, I think we should run this every month.

It's not a big deal, it takes almost an hour per class and it can be easily run in the background while you are working on other stuff. Either way, I don't mind changing the frequency of the runs to once a month or leaving it as is.

I think output would have been different if it didn't log in.

Fair enough. I still recommend observing the first 5 minutes or so to make sure that everything is running smoothly.

I will try a run again on 2023 only.

Alright. Good luck.

EssamWisam commented 4 months ago

Not wrong per se...

My fault for not noticing the note in the original text of the issue. I am now scared and hope inshAllah nothing will happen to my account. Is the dummy account you have been using still alive after using it multiple times?

It's not a big deal, it takes almost an hour per class...

I know it can be run in parallel but maybe some people like me get bothered when many tabs or programs are open unused (I frequently try to avoid that) and it's really longer than I expected. I have no idea why it is that slow; scrapping usually tends to be somewhat faster.

Will set my expectations better next time when I try running it inshAllah.

I will try a run again on 2023 only.

After some time to recover from the minor shock...

Iten-No-404 commented 4 months ago

I am now scared and hope inshAllah nothing will happen to my account. Is the dummy account you have been using still alive after using it multiple times?

Don't worry, the ban affects the account almost instantaneously. If you are able to manually login right now, then there shouldn't be any problem. As for the dummy account, it got some one-day bans/blocks but it is still active and useable. It is easy to wait out the bans. So, your account should be fine insha'allah.

I have no idea why it is that slow; scrapping usually tends to be somewhat faster.

True, the script can be optimized a little to be faster but for now I think it's good enough.

After some time to recover from the minor shock...

No problem, take your time.

Iten-No-404 commented 4 months ago

@EssamWisam, I have run the script today, and so I will close all the 3 reminder issues. You don't need to run the script any time soon.

EssamWisam commented 4 months ago

@EssamWisam, I have run the script today, and so I will close all the 3 reminder issues. You don't need to run the script any time soon.

Thank you so much. inshAllah next time I will be aware of the consequences and properly ready when I do it.

Iten-No-404 commented 4 months ago

Thank you so much. inshAllah next time I will be aware of the consequences and properly ready when I do it.

Don't mention it. I really didn't do much. I left it running in the background while working. It didn't affect my schedule in any way.