IonicaBizau / scrape-it

🔮 A Node.js scraper for humans.
http://ionicabizau.net/blog/30-how-to-write-a-web-scraper-in-node-js
MIT License
4.01k stars 221 forks source link

Problem with closest #66

Closed LandonSchropp closed 6 years ago

LandonSchropp commented 7 years ago

Thanks for creating such an awesome library! It was really easy to get up and running, and I'm enjoying it quite a bit.

I'm trying to write a simple function that scrapes a StackOverflow profile. Here's what I have so far:

function fetchUser(id) {
  return scrapeIt(`https://stackoverflow.com/users/${ id }`, {
    name: ".user-card-name",
    location: {
      selector: ".locationIcon"
      closest: "li"
    }
  });
}

I tried running this function against my own profile:

fetchUser(262125).then(console.log);

Here's what I got:

{ 
  name: 'LandonSchropp',
  location: '' 
}

This is what the DOM looks like on that page:

<ul class="list-unstyled">
  <li>
    <svg role="icon" class="svg-icon iconLocation" width="18" height="18" viewBox="0 0 18 18">
      <path d="..."></path>
    </svg>
    Seattle, WA
  </li>
  ...
</ul>

Is this a bug? Shouldn't the closest li to iconLocation contain the text Seattle, WA?

Thanks!

IonicaBizau commented 7 years ago

I think it's because of the svg element. Access that li element without using the SVG classes.

I don't know the context but maybe the REST API would be better for Stackoverflow.

Sent from my iPhone

On 26 Jul 2017, at 11:06, Landon Schropp notifications@github.com wrote:

Thanks for creating such an awesome library! It was really easy to get up and running, and I'm enjoying it quite a bit.

I'm trying to write a simple function that scrapes a StackOverflow profile. Here's what I have so far:

function fetchUser(id) { return scrapeIt(https://stackoverflow.com/users/${ id }, { name: ".user-card-name", location: { selector: ".locationIcon" closest: "li" } }); } I tried running this function against my own profile:

fetchUser(262125).then(console.log); Here's what I got:

{ name: 'LandonSchropp', location: '' } This is what the DOM looks like on that page:

  • Seattle, WA
  • ...

Is this a bug? Shouldn't the closest li to iconLocation contain the text Seattle, WA?

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

LandonSchropp commented 7 years ago

Is it possible to fix the issue with the SVG element? I went down the road of the Stack Overflow REST API, but using scrapeIt was much easier.

IonicaBizau commented 7 years ago

As long cheerio is parsing the SVG, this should work. Can you share the minimal script including the function call and I will take a deeper look.

Thanks!

LandonSchropp commented 7 years ago

Sure. Here's a minimal example. Thanks!

LandonSchropp commented 7 years ago

If you change the selector for location to .list-unstyled and remove the closest value, you should be able to see the iconLocation class on one of the SVG elements.

IonicaBizau commented 6 years ago

@LandonSchropp Sorry for late answer (I've been in India when you opened this discussion and totally missed the latest comments!).

So, it turns out that cheerio does parse the SVG code and it works. There's just a small typo in your code. The class should be iconLocation (like in your last comment) and not locationIcon. 🙈

const scrapeIt = require('..');
const R = require('ramda');

function fetchUser(id) {
  return scrapeIt(`https://stackoverflow.com/users/${ id }`, {
    name: {
      selector: ".user-card-name",
      how: "html",
      convert: R.replace(/\n[\s\S]*/m, "")
    },
    location: {
      selector: ".iconLocation",
      closest: "li"
    }
  });
}

fetchUser(262125).then(console.log);
IonicaBizau commented 6 years ago

Ah, and one thing tho, there was a bug (which I'm fixing right now) that the closest was being ignored without a convertor.

LandonSchropp commented 6 years ago

Whoops. Thanks for pointing out that issue with my code, and thanks for fixing the closest bug!