PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.81k stars 212 forks source link

Monster results contain CSS in blurb field #81

Closed PaulMcInnis closed 4 years ago

PaulMcInnis commented 4 years ago

Description

Hey everyone, I was gonna cut us a new release, but I noticed an issue:

Currently you get blurbs like below for all jobs scraped from Monster with GlassdoorStatic scraper:

.css-1noe2rc *{color:#505863;line-height:1.4em;}.css-1noe2rc .ecgq1xb1{padding-left:0;}.css-1noe2rc .ecgq1xb1 .ecgq1xb0{margin:0 0 8px 0;}.css-1noe2rc ol,.css-1noe2rc ul{padding-left:32px;}.css-1noe2rc li{margin:10px;margin-bottom:5px;margin-left:20px;line-height:1.4em;}.css-58vpdc{margin-bottom:24px;}.css-58vpdc ul{margin:5px 0 10px 20px;}.css-58vpdc ul > br{display:none;}.css-58vpdc ul > li{margin-left:0;}.css-58vpdc li{padding:0;}PlayStation isn't just the Best Place to Play it's also the Best Place to Work. We've thrilled gamers since 1994, when we launched the original ...

Since glassdoor runs first most of the duplicates are in other job sites and as a result most of the jobs I scrape now have blurbs like the one above.

I was just checking out GlassDoorDynamic and it seems to work well but it misses the date and blurb fields for jobs. As a side note, watching the browser windows go by made me feel like I was in the matrix 😎

Perhaps it is easier for us to purge some of the CSS from these blurbs in the GlassDoorStatic scraper with a regex for longest string in the raw scrape? Open to suggestions.

Alternatively we could just:

Steps to Reproduce

Easily replicable on the stock YAMl on current master with command funnel -kw Engineer, resulting ./search/masterlist.csv will contain aforementioned results.

Expected behavior

blurb should not contain CSS

Actual behavior

blurb contains CSS.

Environment

bunsenmurder commented 4 years ago

Hey @PaulMcInnis are you still getting this issue? I was unable to reproduce it.

PaulMcInnis commented 4 years ago

I just checked now on current master with default settings.yaml, was unable to repro - I'll close this issue and cut a new release.