YusukeIwaki / puppeteer-ruby

A Ruby port of Puppeteer
Apache License 2.0
290 stars 41 forks source link

XPath attributes without evaluation() #305

Closed n1xn closed 1 year ago

n1xn commented 1 year ago

Simple description about the feature

The title is simplest description possible.

Descrption

I am scrapping a pretty big list of product links (> 1000) and have issues that puppeteer throws cannot find context with specified id undefined. This is because I use XPath for collecting the nodes and evaluate the array in the next step. I would like to avoid the evaluate operation after using Sx() and access / map the desired attribute to a variable. Therefore I played around with the options we have with Sx()[] and found out that the attributes are actually all loaded - which means the evaluation step afterwards is not needed. My problem is that I have to replace JSHandle: when accessing an attribute within its string.

Puppeteer reference

Current issue

  1. Here is an example, which I would like to show you. This example is the one actually failing after some 'hundred-ish' iterations.

    paginations.each do |pagination_step|
    xpath_links = '(//a[contains(concat(" ", normalize-space(@class), " "), " productlist__link ")])'
    link_nodes = page.Sx(xpath_links)
    
    link_nodes.each do |product_link|
    
    # evalution will be called thousands of times.      <---------
    href = page.evaluate('e => e.href', product_link)
    # href = product_link.evaluate('e => e.href')
    
    product = { href:, category: }
    products.push(product)
    end
    end
  2. After realizing that this is exceeding some limitations by browsers / puppeteer I have tried to optimize the evaluation to execute only once and setting the desired attribute href.

    paginations.each do |pagination_step|
    xpath_links = '(//a[contains(concat(" ", normalize-space(@class), " "), " productlist__link ")])'
    link_nodes = page.Sx(xpath_links)
    
    # executing now only on each pagination step - better.       <---------
    product_links = page.evaluate('e => e.map((el) => el.href)', link_nodes)
    product_links.each do |product_link|
    
    # but product_link is actually empty.       <---------
    product = { href: product_link, category: }
    products.push(product)
    end
    end
  3. As mentioned in the comment in code 2. the problem is that the mapped evaluation does not contain any values (tried also el.getAttribute('href')). So I tried to access the properties from Sx directly in ruby via property('href') and actually got the value but prefixed with JSHandle: - which I replaced and got it working.

    paginations.each do |pagination_step|
    xpath_links = '(//a[contains(concat(" ", normalize-space(@class), " "), " productlist__link ")])'
    link_nodes = page.Sx(xpath_links)
    
    # do not evaluate anything - loop through nodes
    link_nodes.each do |product_link|
    
    # access the current nodes property and remove JSHandle: prefix.      <---------
    href = product_link.property("href").to_s.gsub('JSHandle:', '')
    
    product = { href:, category: }
    products.push(product)
    end
    end

Usecase / Motivation

I am not sure if I am using this right or missed a conzept, but as mentioned I have a problem with page.evaluate(). I would like to get attributes by xpath without hacking .to_s.gsub('JSHandle:'.''). See the code below for my suggestion.

xpath = '//expression'
xpath_nodes = Sx(xpath)

xpath_nodes.each do |node|
  href = node.attribute('href')
end