EssamWisam / cmp-docs

A comprehensive guide for prospective, current and past students in the computer engineering department of Cairo university.
https://cmp-docs.pages.dev
52 stars 8 forks source link

Automated linkedin scraping #49

Closed KnockerPulsar closed 4 months ago

KnockerPulsar commented 5 months ago

Work in progress, still needs integration with class yaml files

vercel[bot] commented 5 months ago

The latest updates on your projects. Learn more about Vercel for Git โ†—๏ธŽ

Name Status Preview Comments Updated (UTC)
cmp-docs โœ… Ready (Inspect) Visit Preview ๐Ÿ’ฌ Add feedback May 14, 2024 11:28am
cloudflare-workers-and-pages[bot] commented 5 months ago

Deploying cmp-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: aaa5b74
Status: โœ…  Deploy successful!
Preview URL: https://3a6367ab.cmp-docs.pages.dev
Branch Preview URL: https://automated-linkedin-scraping.cmp-docs.pages.dev

View logs

netlify[bot] commented 5 months ago

Deploy Preview for cmp-docs ready!

Name Link
Latest commit aaa5b74943d08b907aebe31d973cf6d77d551b71
Latest deploy log https://app.netlify.com/sites/cmp-docs/deploys/66434aa59c41d10008cdd72d
Deploy Preview https://deploy-preview-49--cmp-docs.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

EssamWisam commented 5 months ago

Sincere thanks to all the work so far and the final integrations coming ahead ๐Ÿ™.

KnockerPulsar commented 5 months ago

Sincere thanks to all the work so far and the final integrations coming ahead ๐Ÿ™.

I'm making some changes to improve robustness (mostly just handling exceptions and retrying), but the final output should be the same. Right now I read the input yaml file and modify it in memory, before writing it into "test.yaml" when scraping is done. I think that the next logical step would be to modify the input file itself, but I'll leave that until I'm sure things work well.

EssamWisam commented 5 months ago

I'm making some changes to improve robustness (mostly just handling exceptions and retrying), but the final output should be the same. Right now I read the input yaml file and modify it in memory, before writing it into "test.yaml" when scraping is done. I think that the next logical step would be to modify the input file itself, but I'll leave that until I'm sure things work well.

Sounds fantastic. Yep what you're doing seems the ideal way to go as the feature is being experimented.

KnockerPulsar commented 5 months ago

@EssamWisam So I think I'm done with modifications so far. I changed the class yaml file structure a bit, but hopefully it shouldn't be much to integrate this with the frontend code. I couldn't test my latest change since LinkedIn flagged my profile so now I have to manually verify after each login, but I'm sure the changes work fine.

EssamWisam commented 5 months ago

Looking at the class yaml I can only see another field for the LinkedIn URL. So you're next step is to add the scraped image, headline, top skills and current role to the fields of each student right?

Once you do let me know and I can follow up in this PR by fixing the frontend as needed.

KnockerPulsar commented 5 months ago

Looking at the class yaml I can only see another field for the LinkedIn URL. So you're next step is to add the scraped image, headline, top skills and current role to the fields of each student right?

Once you do let me know and I can follow up in this PR by fixing the frontend as needed.

The code to add those fields and update the file exists (and should work), but as mentioned previously LinkedIn is giving me a hard time testing so I can't run the script and push the updated file right now.

Can you try running it on your end and seeing how it goes? Don't forget to download the chrome driver and comment the headless option line so you can see if the scraper gets stuck on some page. You usually get a 2FA verification page the first time your run the script, so you can probably first sign-in in an incognito window, then run the script.

Iten-No-404 commented 5 months ago

@KnockerPulsar, I tried running the script on the 2023 class. Fixed some minor issue in how the yaml file is modified and saved at the end of the script. I also increased the delay between viewing profiles to decrease the chances of the account being flagged.

It's working well overall but I noticed it doesn't always extract the current position. I am not sure why exactly since I haven't tried debugging it. We can add it as an issue for now. Either way, great work! ๐Ÿ‘๐Ÿป

Lastly, I modified the UI naively to accommodate the new properties (title, current position, & top skills).

As for the next steps, I'll create a mini-script that adds these new properties to the Arabic yaml from the English yaml and run the script on the rest of the English class yaml files.

I think we can merge this PR unless there's something you'd like to modify, @EssamWisam.

EssamWisam commented 5 months ago

It's working well overall but I noticed it doesn't always extract the current position.

Presumably, this only occurs at a low probability? Because if it's high it may be faster for @KnockerPulsar to fix before merging.

Lastly, I modified the UI naively to ...

Thank you so much; impressing effort as usual!

...to accommodate the new properties (title, current position, & top skills).

I assume it implies that unlike I expected, @KnockerPulsar you didn't extract links for profile images. That's fine but it's really nice to have for the future (e.g., in another PR).

...run the script on the rest of the English class yaml files.

Hope the script handles the case where the LinkedIn doesn't exist and does nothing in return. I'm afraid most if not all others did not include LinkedIn links. That said, I think this feature will attract more additions (and next summer representatives will have more time for this).

I think we can merge this PR unless there's something you'd like to modify

Sure thing. I was going to ask for a screenshot for the UI to see if I can recommend anything but I do realize that the natural expectation should be that it's great anyway so you can just go ahead and merge.

Iten-No-404 commented 5 months ago

Presumably, this only occurs at a low probability? Because if it's high it may be faster for @KnockerPulsar to fix before merging.

It's actually a high probability 58 out of 83 students, in other words ~70%, have their current position as null.

Thank you so much; impressing effort as usual!

No need yet at least. ๐Ÿ˜‚ I haven't done much to the UI. Look at this: image I am designing a new look for it but it is not finished yet. Should be done in a couple of hours.

I assume it implies that unlike I expected, @KnockerPulsar you didn't extract links for profile images.

No, the images are indeed extracted correctly. I just didn't need to modify the UI for them since we already had a key for the image link in the yaml file beforehand and used it.

Hope the script handles the case where the LinkedIn doesn't exist and does nothing in return. I'm afraid most if not all others did not include LinkedIn links. That said, I think this feature will attract more additions (and next summer representatives will have more time for this).

Will do!

Sure thing. I was going to ask for a screenshot for the UI to see if I can recommend anything but I do realize that the natural expectation should be that it's great anyway so you can just go ahead and merge

Let's see first if Tarek can fix the current position issue or identify the cause as you've said and by then the mini-script and better UI should be all done as well.

EssamWisam commented 5 months ago

It's actually a high probability 58 out of 83 students, in other words ~70%, have their current position as null.

Okay then let's wait for input from @KnockerPulsar.

Possibilities are: it's just a bug and he knows an easy fix for it or we drop this overall and assume that the title carries such information. That said, in the latter case I assume that we are using the fetched profile images because:

No need yet at least. ๐Ÿ˜‚ I haven't done much to the UI. Look at this:

Surely still impressing: @KnockerPulsar requested that I run and have been scheduling it for later as well as expecting me to do the UI. Thus, I'm thankful and impressed to see this ๐Ÿ™.

I am designing a new look for it but it is not finished yet. Should be done in a couple of hours.

It does the job anyway. Sure what you will come up will be great inshAllah. My original throughts were just that it would be nice to show the title under the name before showing the modal (especially if it's short indicating a role) and a green circle around those with no role (but the latter may be dropped as I said if fixing the issue is hard).

Let's see first if Tarek can fix the current position issue or identify the cause as you've said and by then the mini-script and better UI should be all done as well.

Makes sense. Thank you so much for the effort.

KnockerPulsar commented 4 months ago

I'll look into the current position bug as soon as I can :)

EssamWisam commented 4 months ago

I'll look into the current position bug as soon as I can :)

Alrighty. Reminder that if it turns out to be complex (I know timeline div is spaghetti), a basic solution that simply tells us whether the person is hired or not will suffice.

For instance, just search that div for the word present and conclude person is not hired if it does not exist.

Iten-No-404 commented 4 months ago

I'll look into the current position bug as soon as I can :)

Thank you, no rush. :D

This is how the student profiles look now: image image The Arabic version doesn't look the best since half of it is in English but it is okay for the data we have.

  • The purpose of extracting the role was mainly to find those with no current role and put a green circle around them

Added the green circle for now, could easily remove it if we find that it is unneeded later on. I also added the title below the student circles. I have a concern though, some people have really long titles, maybe I could limit the number of characters and add a trail of ... instead? image

Lastly, added a mini function/script that adds the same properties extracted by the scraper to the Arabic YAML as well.

EssamWisam commented 4 months ago

Top-of-the-line work. Thank you for your time and effort in this valuable contribution @Iten-No-404 .

Only one very minor comment: for the headline when printed under the profile, do you think it can make it more uniform if:

I think the one showing up when the profile is clicked can contain the full information.

Iten-No-404 commented 4 months ago
  • We only use first k characters of the headline and then ... if it exceeds that k
  • We split at the first punctuation mark (can be pipe or comma or... as in image) and take the first entry only

Employed a mix of both. Its steps are as following:

  1. Check if the length is less than 40 characters, then leave it as is.
  2. If not, then try to split on either "|" or "," since they are the most used connectors in LinkedIn titles and show the first split if it's less than 40 characters.
  3. Lastly, if the first split was more than 40 characters or there wasn't any possible split, I'll show the first 37 characters + "..."

And here's a preview of it after the change: image

Top-of-the-line work. Thank you for your time and effort in this valuable contribution @Iten-No-404 .

You're very welcome. Glad I could help.๐Ÿ˜Š

KnockerPulsar commented 4 months ago

Hey folks, spent a couple of hours debugging this. Even after employing a brute-force solution (recursively checking all children of section tags for id == 'experience'), the section is sometimes still not found. I'll try giving it another debugging session on the weekend. Meanwhile, I'd be happy to provide guidance if anyone wants to try their hand :)

EssamWisam commented 4 months ago

Thank you so much for your time.

I have just tried inspecting the HTML in my profile and it appears to be in a div tag rather than a section tag, assuming you were speaking literally.

That said, I know a group of credit students that work at some company where they tell me they did LinkedIn scraping. Can try to reach out if needed.

Iten-No-404 commented 4 months ago

Hey folks, spent a couple of hours debugging this. Even after employing a brute-force solution (recursively checking all children of section tags for id == 'experience'), the section is sometimes still not found.

Thank you, @KnockerPulsar. What would you say is the current percentage of current positions found?๐Ÿค”

all children of section tags for id == 'experience')

it appears to be in a div tag rather than a section tag,

I analyzed the structure of the page a little and these are my findings: Let's say after searching for id == 'experience', we get the following:

<div `id == 'experience'`></div>
<div>Ignore this division tag that is strictly the first child after the id found.</div>
<div>
   <ul>
      <li>
         <div>
            <div>Ignore this division tag that is strictly the first child.</div>
            <div>This division tag contains a series of nested div tags that contain details about the newest current position in span tags.</div>
         </div>
      </li>
      <li>This list item has the same structure as the one above but we are probably only interested in the first one any way.</li>
   </ul>
</div>

I think this can help.

I'll try giving it another debugging session on the weekend.

Best of luck.

Meanwhile, I'd be happy to provide guidance if anyone wants to try their hand :)

I can give it a try as well in the weekend and if I have a problem with something, I'll be sure to let you know.

KnockerPulsar commented 4 months ago

Alright, I think I figured out what was broken. I was checking if one field existed or not and assigned a completely different field...

With that behind my back, one small note I'd like to add is that the scraper always fails to scrape the account that its running from (i.e. the account of Tarek Yasser fails to scrape the profile of Tarek Yasser), this is because the HTML layout is different if you're viewing your own profile. It shouldn't be a problem though since we planned on using a completely separate account anyway.

@EssamWisam @Iten-No-404 Can you test and confirm with me?

Iten-No-404 commented 4 months ago

Alright, I think I figured out what was broken. I was checking if one field existed or not and assigned a completely different field...

Amazing work, @KnockerPulsar ! The students without a current position are now 31/83, 9 of which don't have a Linked account so we could say it's 22/74 i.e. ~30% not extracted correctly or don't actually have anything added as a current position. I think that's a much better percentage.

With that behind my back, one small note I'd like to add is that the scraper always fails to scrape the account that its running from (i.e. the account of Tarek Yasser fails to scrape the profile of Tarek Yasser), this is because the HTML layout is different if you're viewing your own profile. It shouldn't be a problem though since we planned on using a completely separate account anyway.

Good to know. I used a separate account to run it on the 2023 & 2024 Credit yaml files. The test went well, I just modified a few minor things to run the script on yaml files other than 2023 (different yaml structure).

I think we can merge this PR if no one has anything else to add.

EssamWisam commented 4 months ago

Thanks a lot for the effort @KnockerPulsar and thanks for the follow up @Iten-No-404 .

The claim so far is that the accuracy is 70% or higher which is great but what we may really need to look at is the recall of the class of interest (not currently hired) as a major motivation for this feature has been to be able to find those not hired yet and help them out whenever possible. Thus, there are two possibilities:

What do you think @KnockerPulsar @Iten-No-404 ?

Iten-No-404 commented 4 months ago

we may really need to look at is the recall of the class of interest (not currently hired) as a major motivation for this feature

@EssamWisam, I agree with you but there's a small problem. Not everyone keeps their LinkedIn profiles up-to-date with their current employment status.

I analysed the 31 students with empty current positions by manually going to their LinkedIn profiles:

As you can see above, the 26 students (9+17) would be correctly marked as unemployed as per the available data. However, we don't have the means to tell whether the student is truly unemployed or simply didn't update their LinkedIn unless we ask them directly which isn't a really feasible thing to be done periodically.

  • This can cause confusion down the line, if person X talk to their company about fantastic person Y who seems to be not hired according to the website only to find about that Y is actually hired when they reach out to them.

Well, person X can reach out to person Y before recommending them which shouldn't cause any confusion.

Lastly, for possible courses of action, we can either:

If you have any other ideas, feel free to suggest them.

EssamWisam commented 4 months ago

@Iten-No-404 Valuable input and agree with your general idea. Thank you for your time in checking. What I may add is:

The options I see assuming recall is improved or if we can know whether scraping failed for a specific individual:

Well, person X can reach out to person Y before recommending them which shouldn't cause any confusion.

Sometimes, and that happened to me before, one does know the feasibility of the recommendation so has to reach out to the company first "I know friend Y who is fantastic in all they're doing and...." and find if it will be feasible to get them in. It maybe causes less hard feelings to do this without letting friend Y know you will do it and if it's not possible tell them nothing and if it's possible, give them the good news. Likewise, in some cases, X may have many Ys and it would be tedious to ask each of them compared to just checking from CMPDocs.

The options I see assuming recall is not improved:

Will leave the executive decision to you but would be happy to hear the rationale.

P.S.: Notice that with the definition (which could be put at the bottom in grey and small font); anyone who is employed but fails report on LinkedIn can be assumed as unemployed and they will be the ones that have to say sorry when someone recommends them as they report themselves as not currently working on LinkedIn. That is, we can disregard this case.

KnockerPulsar commented 4 months ago

I just went through the scraping output to double-check the results. Profiles with invalid profile URLs don't have a current position field, profiles which don't have an experience section or don't have a present position have the field as null, and otherwise, the positions seem correct. @Iten-No-404 mind sharing the problematic profiles so I can take a closer look?

Iten-No-404 commented 4 months ago

@Iten-No-404 mind sharing the problematic profiles so I can take a closer look?

Sure, @KnockerPulsar. Here they are:

KnockerPulsar commented 4 months ago

All of them seem to have valid data on my end. I suspect two things:

  1. The page not being fully rendered: There's a delay between us requesting the page and the response, as well as the full page being rendered since LinkedIn responds with Javascript and not raw HTML. I haven't found a way to check if the page is "fully rendered" yet, so we can work around this by increasing the used delays.
  2. Privacy settings: Since you're not using your own profile, I suspect that the five profiles you mentioned might have some parts of their profiles hidden for accounts they're not connected with.

Can you try running a couple more times and seeing if the results change? You can slim down the class YAML file to only these five profiles and comment out the headless option in the script to see if the profile fetched is different between the testing and your personal account.

Iten-No-404 commented 4 months ago

2. Privacy settings: Since you're not using your own profile, I suspect that the five profiles you mentioned might have some parts of their profiles hidden for accounts they're not connected with.

It can't be this option since I viewed them using the account I am parsing them with and their current position could be viewed without any problem.

Can you try running a couple more times and seeing if the results change? You can slim down the class YAML file to only these five profiles

On it.

Iten-No-404 commented 4 months ago

@KnockerPulsar, you were correct. It was probably a internet connectivity issue. I tried running the script on the missing 5 only and it got them all correct.

- title: '๐Ÿง‘๐Ÿปโ€๐ŸŽ“ CMP23 Class ๐Ÿ‘ฉ๐Ÿปโ€๐ŸŽ“ '
  description: Enter our computer engineering class, where students exhibit an exceptional
    level of expertise. They handle computers and technology with remarkable ease,
    crafting remarkable innovations effortlessly.
  markdown_title: Student Info
  items:
  - title: ๐Ÿ“ƒ Class Roster
    description: Members of the class are as follows
    items:
    - name: Ahmed Mahmoud ElGhareeb
      image: https://media.licdn.com/dms/image/C4E03AQH2DJsvo1VLXA/profile-displayphoto-shrink_200_200/0/1646243329706?e=1720656000&v=beta&t=31Af4Mh1izlpwL9nY7bB6Ghq4kx6CVyYj5PqIyLAurQ
      linkedin_url: https://www.linkedin.com/in/ahmed-ghareeb-b762391b2/
      markdown: You can know more about Ahmed and reach him out by visiting his [LinkedIn]()
        profile.
      title: Software Engineer @ Microsoft
      top_skills: null
      current_position: Software Engineer, Microsoft ยท Full-time, May 2024 - Present
        ยท 1 mo
    - name: Ahmed Mahmoud Mohamed
      image: https://media.licdn.com/dms/image/D4D03AQFQ8Aw7UxumhA/profile-displayphoto-shrink_200_200/0/1686027677967?e=1720656000&v=beta&t=_qf2cJJiQsvAGMC3kFYKVVyps0HFbFNQWhGAHAES7-w
      linkedin_url: https://www.linkedin.com/in/ahmed-mahmoud-970378205/
      markdown: You can know more about Ahmed and reach him out by visiting his [LinkedIn]()
        profile.
      title: Software Engineer @GizaSystems
      top_skills: null
      current_position: Software Engineer, Giza Systems, Mar 2024 to Present ยท 3 mos
    - name: Mohamed Akram Abdelfattah
      image: https://media.licdn.com/dms/image/C4D03AQGEdHi4bPbvfQ/profile-displayphoto-shrink_200_200/0/1593717419051?e=1720656000&v=beta&t=TBuafeYjfCCpt6UGq6ON2bamkmWGr9Pmy4C2WPBbiPA
      linkedin_url: https://www.linkedin.com/in/mohamed-akram99/
      markdown: You can know more about Mohamed and reach him out by visiting his
        [LinkedIn]() profile.
      title: Computer Engineer
      top_skills: null
      current_position: Software Engineer, NTG Clarity ยท Full-time, Feb 2024 - Present
        ยท 4 mos
    - name: Mariem Mohamed Zein
      image: https://media.licdn.com/dms/image/D4D03AQE4vRJf2ZGHYA/profile-displayphoto-shrink_400_400/0/1668196454248?e=1720656000&v=beta&t=CnqSqY-TCNbiOs3iFUH8U_54rz5rR6lg3Rl07dm0mtg
      linkedin_url: https://www.linkedin.com/in/mariem-muhammed-1009801b1/
      markdown: You can know more about Mariem and reach her out by visiting her [LinkedIn]()
        profile.
      title: Associate Software Engineer
      top_skills: null
      current_position: Associate Software Engineer, Sumerge, Dec 2023 - Present ยท
        6 mos
    - name: Noran Hany Mohamed
      image: ''
      linkedin_url: https://www.linkedin.com/in/noran-hany-a69103215/
      markdown: You can know more about Noran and reach her out by visiting her [LinkedIn]()
        profile.
      title: Computer Vision Engineer @ Voyance
      top_skills: null
      current_position: Computer Vision Engineer, Voyance ยท Full-time, Nov 2023 -
        Present ยท 7 mos

I will merge them with the main file and upload them in a commit.

EssamWisam commented 4 months ago

@Iten-No-404 I have just checked the most recent deployment and the UI looks perfect. Thanks.

One minor exception was when I clicked on a student in CCE25.

image

Also, what do you think about not showing the green circle or "Current Position" field at all for profiles with no LinkedIn accounts.

I can give that a shot if you are busy.

Thank you.

Iten-No-404 commented 4 months ago

@Iten-No-404 I have just checked the most recent deployment and the UI looks perfect. Thanks.

Don't mention it, @EssamWisam. :)

One minor exception was when I clicked on a student in CCE25.

You're right. I traced it and found that the problem lied in the YAML files of the 2025. They had links of invalid images. This means that this exception will happen every time an image link becomes outdated. That's why I added error handling using onError.

The options I see assuming recall is improved or if we can know whether scraping failed for a specific individual:

  • For those with no LinkedIn link in the first place, don't put the green circle, we can define it as "Those with LinkedIn accounts and appear to be unemployed"
  • For those where scraping failed maybe put a thinner orange circle (and it will be defined as "Career information about that person couldn't be extracted").

Also, what do you think about not showing the green circle or "Current Position" field at all for profiles with no LinkedIn accounts.

Well, since we have a limited amount of data: the Student Name (which can be considered an ID and the only non-empty value), LinkedIn account link, image link, title, current position, and top skills all of which are optional. Many students haven't added their top skills yet and some prefer not putting an image. The student name doesn't add much since it is required anyway. Then, we can ignore them and consider the LinkedIn link, title, and current position as the only variables we have. So, my suggestion would be similar to your previous suggestion:

I also thought about using a check on the title but I noticed that almost everyone who has a LinkedIn account has a title so I considered it as redundant info in regards to the checks. Here's a preview of the solution: image image

I can give that a shot if you are busy.

Thank you for the offer. I had my hands full these past few days but thankfully I pulled through. I'll push these changes so you can view them in the preview and let me know your opinion on whether something could be improved.

EssamWisam commented 4 months ago

Thanks @Iten-No-404 . Bug is fixed indeed. When I mentioned the check, I meant that for those with no LinkedIn account to write nothing instead of Current Position: N/A (i.e., for it will show only This is one fantastic CMP student.)

That said, I like your suggestion. But something that confused me a little is that CCE 25 Class has green circles despite no Linkedin accounts being there.

Okay: one thing crossed my mind. Maybe instead of writing "Open to Work", we right "Currently Unemployed" or something like that because its what aligns exactly with the definition; meanwhile, someone could be taking a pause or being drafted and that's why they aren't working.

Iten-No-404 commented 4 months ago

When I mentioned the check, I meant that for those with no LinkedIn account to write nothing instead of Current Position: N/A (i.e., for it will show only This is one fantastic CMP student.)

Alright, simple enough. Done.

That said, I like your suggestion. But something that confused me a little is that CCE 25 Class has green circles despite no Linkedin accounts being there.

Small mistake on my end. Fixed.

Okay: one thing crossed my mind. Maybe instead of writing "Open to Work", we right "Currently Unemployed" or something like that because its what aligns exactly with the definition; meanwhile, someone could be taking a pause or being drafted and that's why they aren't working.

I get your point. Done. I also added Arabic versions for the tooltips.

Thanks @Iten-No-404

Not at all @EssamWisam, thank you for noticing these bugs as well. If there's anything else that needs fixing, let me know. Otherwise, we can merge this PR.

EssamWisam commented 4 months ago

image

Just one last minor observation that either needs no solution or a fast one: I noticed my friend Mohamed Saad has no current employment (will technically he has; but not mentioned on LinkedIn) but still no green circle.

Other than that, I say we merge this.

Sincere thanks again @KnockerPulsar and @Iten-No-404.

Iten-No-404 commented 4 months ago

Just one last minor observation that either needs no solution or a fast one: I noticed my friend Mohamed Saad has no current employment (will technically he has; but not mentioned on LinkedIn) but still no green circle.

We definitely can't have that. Surprisingly, he is the only one who had this problem in the YAML. I made a small fix to the script by adding some if conditions and ran the script on the whole C2023.yaml for good measure.

Other than that, I say we merge this.

Consider it done.

Sincere thanks again @KnockerPulsar and @Iten-No-404.

You're very welcome.