gajus / surgeon

Declarative DOM extraction expression evaluator. 👨‍⚕️
Other
695 stars 30 forks source link

single and double quotes don't work #21

Closed foundAhandle closed 6 years ago

foundAhandle commented 6 years ago

Can't use attribute selectors.

gajus commented 6 years ago

Sure you can.

Can you share an example of a query?

You are probably simply missing quotes.

foundAhandle commented 6 years ago

I found out about Surgeon through your comment at the bottom of A Guide to Automating & Scraping the Web with JavaScript (Chrome + Puppeteer + Node JS).

I disagree with your statement that it is a "really silly idea to use Puppeteer to “scrape the web”". Scraping sites is more than just extracting data, its also navigating those sites - clicking buttons, working through search engines - in order to get to the data. That's where Puppeteer/Chrome or CasperJS/PhantomJS come into play. I've used both combinations to make web crawlers and they make things many times easier.

Also, your assertion that the example web site is a SPA is not correct. Clicking on a book link navigates the browser to different page each time thus not making it a single page application.

In spite of this, I'm trying to use Surgeon with Puppeteer and that's where I ran into the Issue that I filed. Here's a gist.

gajus commented 6 years ago

In spite of this, I'm trying to use Surgeon with Puppeteer and that's where I ran into the Issue that I filed. Here's a gist.

As I said, you are missing quotes around your selector, i.e. it should be sm '[sitetranslationname="$barstate_80"]' | rtc.

DaniGuardiola commented 6 years ago

@foundAhandle I don't know how much experience you have with scraping, but I've only had to use a headless browser once (and I've scraped and crawled a lot of different things). In that case, a weird complicated encryption (anti-scraping) system was in place and I decided it was easier to just run a PhantomJS setup on my local machine for a one-off scrap.

But in pretty much any other case, it was always easier, more straightforward and 100x faster and cheaper to just find out the server APIs and use them. When you click a button to receive data, a request is being made to a server with a certain interface. You just need to understand how those endpoints work and use them. Believe, it's waaaay cleaner and cheaper, and will save you some headaches. And money.

clicking buttons, working through search engines - in order to get to the data

You don't need to actually load the page to get the data :)

DaniGuardiola commented 6 years ago

@foundAhandle there are a few specific cases where headless browsers might make sense tho, like taking screenshots. But those are very marginal and can be implemented separately as helpers (scrap everything with cheerio / surgeon / request and have a function load a specific url on headless chrome to take a screenshot).

Believe me, that makes much more sense when scraping. Headless browser scraping is un-scalable and a big headache when your project starts growing in volume and complexity.

foundAhandle commented 6 years ago

@gajus I didn't include an outer set of quotes per the note in the docs for Built-in subroutine aliases where it states: "Note regarding s ... alias. The CSS selector value is quoted."

@DaniGuardiola I have to scrape https://lw.com from its home page, through its search engine, and extract data (name, email, phone, practice, description, etc) from all of the attorney bios. Based on your comments, how would you go about doing that?

gajus commented 6 years ago

@gajus I didn't include an outer set of quotes per the note in the docs for Built-in subroutine aliases where it states: "Note regarding s ... alias. The CSS selector value is quoted."

It should be augmented to say: ... unless the expression itself includes quotes.

i.e. sm .foo does not require quotes; sm '[foo="bar"]' needs quotes.

DaniGuardiola commented 6 years ago

@foundAhandle ok let me help you with this :)

First load the page, open the devtools (I'm assuming Chrome) and go to the network tab. Filter to only show 'XHR' requests as that's the kind of request that most applications usually do. It usually helps to click the 'clear' button to make things easier. Now you're ready to inspect the requests.

Proceed with the action you'll be inspecting, in this case the usage of the search engine. You will see a request on the devtools. Click it to get the details. You will see the URL and the query parameters being used. You just need to understand how these parameters work (just use the UI to select your desired parameters and trigger the search with the UI button).

Then you parse the response, which will probably be JSON (which can be easily parsed with JSON.parse) or HTML (use @gajus tool for that).

That will probably give you a list that you can iterate by changing the parameters, that will contain basic data and the item URL. Then, with that list, you can proceed to scrap those pages for complete details.

This would be a very good approach. I recommend MongoDB and the 'request' npm module, that will make your life way easier. And you'll be surprised of how fast your scraper will run in comparison with the headless browser solution.

Let me know if you have any questions / need help with anything :)

foundAhandle commented 6 years ago

@DaniGuardiola I'm familiar with parsing requests and looking at GET and POST name/value pairs. What exactly is your workflow for getting the xhr request from chrome to the npm request module and/or mongo? Are you using HAR files at all? How are you handling cookies, sessions, etc.?

foundAhandle commented 6 years ago

@DaniGuardiola You still there? So I've been checking out the request module and I ran into a problem, how do I execute client side code? The two pages I've tested it on both have dom-altering code that injects the elements that I need. What's the solution?

The links: https://www.skadden.com/professionals?skip=1000&letter=A https://www.gtlaw.com/en/professionals?pageNum=100&letter=A

DaniGuardiola commented 6 years ago

@foundAhandle sorry, busy days. I see you sent me an email. I'll give you a few contact options and I will try to reserve 15 minutes soon to call you and assist you if you want. It will be faster :)

foundAhandle commented 6 years ago

OK. Sounds good.

foundAhandle commented 6 years ago

Hey Dani. I got your email, unfortunately, my emails to you are being blocked. I'd like to talk, I sent you an email from prevail@nyc.rr.com with my phone number. Please call me or respond to that email. Thanks.