EssamWisam / cmp-docs

A comprehensive guide for prospective, current and past students in the computer engineering department of Cairo university.
https://cmp-docs.pages.dev
52 stars 8 forks source link

💡 Use LinkedIn to Find Current Positions and Facilitate Recommendations [Proposal] #33

Closed EssamWisam closed 3 months ago

EssamWisam commented 8 months ago

Background Information:

Idea:

Further Motivation: This solve the problem that a CMP graduate: (i), may not know who is looking for a job in the class/department and its not easy to manually enumerate this and (ii), their company is looking for employees and (iii), they would be willing to recommend someone in the class/department if it wasn't for (i).

Bonus Features: Could also scrap information such as current job (e.g., for stats or just viewing) and profile picture (to close #29)

Formal Description: Given a list of LinkedIn profile links, extract the profile image, current role (if hired) and top skills (if presented) for each. This information will be made use of in the class page.

Feasibility: In a friendly chat, I discussed this idea with Tarek @KnockerPulsar who was also helping in scraping another website with Selenium in another project. I requested that he confirms the feasibility of this, and he (thankfully) confirmed that it could be achieved with Selenium.

What do you think of the proposal @Iten-No-404 @KnockerPulsar ?

@KnockerPulsar Do Github Actions support the Chrome/browser driver needed for Selenium?

Iten-No-404 commented 8 months ago

All in all, I think it's a formidable idea and would be a great addition to the website.

EssamWisam commented 8 months ago

It could help students who are looking for an internship or a part-time job

Yes, absolutely, thanks for shedding light on that. I emphasized graduates because this is where it hurts more when someone doesn't find a job; meanwhile, much more tolerable (and joke salaries anyway) for younger students. But Again, your point is perfectly valid.

It would invite more younger students to seek advice from the ones that already walk a path similar...

Another good point 👌🏻👌🏻.

Thirdly, it would fix and close https://github.com/EssamWisam/cmp-docs/issues/29

Thanks for re-emphasizing

some LinkedIn accounts are private...can just be scraped manually

Well, in my opinion these could be completely ignored because they chose to make their profile private (unless they want to do it manually themselves via PR with changes to their profile only). We could mention at the top of the page that any profile is expected to be public and mention the five top skills that LinkedIn asks for (anyone not complying with that may then be not interested).

All in all, I think it's a formidable idea and would be a great addition to the website.

Thanks. Unless Tarek is interested in working on it (he may be), I have no issues to schedule myself for it in the upcoming weeks.

Iten-No-404 commented 8 months ago

I emphasized graduates because this is where it hurts more when someone doesn't find a job

You're absolutely right.

in my opinion these could be completely ignored because they chose to make their profile private...anyone not complying with that may then be not interested.

Alright, I see your point and I agree.

Thanks. Unless Tarek is interested in working on it (he may be), I have no issues to schedule myself for it in the upcoming weeks.

Great and you're welcome. I have no previous experience with Selenium but if there's something you think I can help with, feel free to let me know.

C-Nubela commented 8 months ago

Hey there @Iten-No-404 @EssamWisam

I work for Proxycurl, a B2B data provider that extensively scrapes LinkedIn, and I just wanted to chime in:

You're gonna' have a tough time scraping LinkedIn. Be prepared to deal with proxies, cookies, rotating LinkedIn accounts, and beyond.

That said, our whole thing at Proxycurl is taking care of the headache that is scraping LinkedIn for you.

We offer several endpoints that you could integrate into your product, such as our Person Profile Endpoint, which could grab details like work history, skills, and beyond.

Send us an email to "hello@nubela.co" if you have any questions!

EssamWisam commented 8 months ago

@C-Nubela Thanks for letting us know. We will surely get in touch should we conclude that our budget and current scraping abilities require that.

EssamWisam commented 5 months ago

@KnockerPulsar Last think I heard from you was decent progress towards this. Can I have a report of how far along are we on the scraping part of this feature?

EssamWisam commented 5 months ago

@KnockerPulsar Understand you can be busy but could you make a PR with the work so far...

KnockerPulsar commented 5 months ago

@KnockerPulsar Understand you can be busy but could you make a PR with the work so far...

My sincerest apologies. I really cannot apologize for this delay. I'll try to make a PR tomorrow after work.

EssamWisam commented 5 months ago

@KnockerPulsar Thanks. I'm looking forward to have this feature be up by the finals. Hopefully, it will see real use after that.

Iten-No-404 commented 4 months ago

@EssamWisam, @KnockerPulsar. I believe we're missing one last thing to close this issue: Preparing a GitHub action to run the script automatically on all the available yaml files (the English ones only since the Arabic are already mapped accordingly) for example once every 2 weeks. I am going to give it a try on a new branch. Let me know if you have any thoughts regarding this.

EssamWisam commented 4 months ago

Indeed, I don't know why I thought initially that Github actions may not support installing a browser driver in the first place which is necessary to the script.

I think if we can it into a Github action, it's easier to configure the periodic running time while not worrying about anything since it can support running every six hours (but LinkedIn won't be happy). So we can even make it every three days or something in that case.

Okay good luck and don't hestitate to mention if any assistance is needed.

Iten-No-404 commented 4 months ago

Okay, so I checked creating a workflow for running the LinkedIn Scraping Script. You can find the latest version of the YAML file here.

The Chrome Driver didn't take much time to set up. I am facing a different problem though. From my understanding, when you try to login from a new IP/MAC address, they give you the following check: image image

I don't think this can be bypassed by scripting and unless we create a dedicated container that has a stable IP/MAC address where we have logged in manually once before, I don't see any other way of overcoming this. So, until we come up with another idea, the script can be run locally once a week and its outputs pushed as any normal commit.

@EssamWisam & @KnockerPulsar, let me know if you have any ideas.

EssamWisam commented 4 months ago

I think it's completely fine for us to run the script locally. I will likely just make a commit that makes it support Microsoft Edge as well as I don't use chrome. Maybe we can rather make a Github action that runs every two week and makes a Pull request asking us (reminding us) to run the script.

In the other issue related to this, we can also add the steps for running the script (which are quite simple).

Other than that, I wonder, does this help: https://stackoverflow.com/questions/66970875/is-it-possible-to-use-a-static-ip-when-using-github-actions

Iten-No-404 commented 4 months ago

I think it's completely fine for us to run the script locally. I will likely just make a commit that makes it support Microsoft Edge as well as I don't use chrome. Maybe we can rather make a Github action that runs every two week and makes a Pull request asking us (reminding us) to run the script.

Sounds good!

In the other issue related to this, we can also add the steps for running the script (which are quite simple).

They can be easily added to the README file.

Other than that, I wonder, does this help: https://stackoverflow.com/questions/66970875/is-it-possible-to-use-a-static-ip-when-using-github-actions

This mentions 2 separate ideas: Larger Runners and Self-hosted Runners. Neither documentation stated explicitly whether we can access the UI of the VM or not. I believe that there is a chance that Self-hosted Runners can be accessed as UI and so, login manually and pass the verification the first time before running the script. Either way, this needs further investigation.

Regarding this current issue, how about we close it and create a new one purely concerned with automatically running the script? That's if we decide to go through with it.

EssamWisam commented 4 months ago

OK. We can close this; I only delayed responding because I initiated communication with one credit friend, that may have experience, regarding this issue but they seem to be busy (exams time). Will come back to either here or another issue if we make one if my friend responds.

As for adding to the README, I was regarding this issue as a candidate for the LinkedIn feature-specific stuff. Just to keep the original README more simple for the broader audience that won't have much to do with adding their class or running the script.