I am scrapping a pretty big list of product links (> 1000) and have issues that puppeteer throws cannot find context with specified id undefined. This is because I use XPath for collecting the nodes and evaluate the array in the next step. I would like to avoid the evaluate operation after using Sx() and access / map the desired attribute to a variable. Therefore I played around with the options we have with Sx()[] and found out that the attributes are actually all loaded - which means the evaluation step afterwards is not needed. My problem is that I have to replace JSHandle: when accessing an attribute within its string.
Here is an example, which I would like to show you. This example is the one actually failing after some 'hundred-ish' iterations.
paginations.each do |pagination_step|
xpath_links = '(//a[contains(concat(" ", normalize-space(@class), " "), " productlist__link ")])'
link_nodes = page.Sx(xpath_links)
link_nodes.each do |product_link|
# evalution will be called thousands of times. <---------
href = page.evaluate('e => e.href', product_link)
# href = product_link.evaluate('e => e.href')
product = { href:, category: }
products.push(product)
end
end
After realizing that this is exceeding some limitations by browsers / puppeteer I have tried to optimize the evaluation to execute only once and setting the desired attribute href.
paginations.each do |pagination_step|
xpath_links = '(//a[contains(concat(" ", normalize-space(@class), " "), " productlist__link ")])'
link_nodes = page.Sx(xpath_links)
# executing now only on each pagination step - better. <---------
product_links = page.evaluate('e => e.map((el) => el.href)', link_nodes)
product_links.each do |product_link|
# but product_link is actually empty. <---------
product = { href: product_link, category: }
products.push(product)
end
end
As mentioned in the comment in code 2. the problem is that the mapped evaluation does not contain any values (tried also el.getAttribute('href')). So I tried to access the properties from Sx directly in ruby via property('href') and actually got the value but prefixed with JSHandle: - which I replaced and got it working.
paginations.each do |pagination_step|
xpath_links = '(//a[contains(concat(" ", normalize-space(@class), " "), " productlist__link ")])'
link_nodes = page.Sx(xpath_links)
# do not evaluate anything - loop through nodes
link_nodes.each do |product_link|
# access the current nodes property and remove JSHandle: prefix. <---------
href = product_link.property("href").to_s.gsub('JSHandle:', '')
product = { href:, category: }
products.push(product)
end
end
Usecase / Motivation
I am not sure if I am using this right or missed a conzept, but as mentioned I have a problem with page.evaluate(). I would like to get attributes by xpath without hacking .to_s.gsub('JSHandle:'.'').
See the code below for my suggestion.
xpath = '//expression'
xpath_nodes = Sx(xpath)
xpath_nodes.each do |node|
href = node.attribute('href')
end
Simple description about the feature
The title is simplest description possible.
Descrption
I am scrapping a pretty big list of product links (> 1000) and have issues that puppeteer throws
cannot find context with specified id undefined
. This is because I useXPath
for collecting the nodes and evaluate the array in the next step. I would like to avoid the evaluate operation after usingSx()
and access / map the desired attribute to a variable. Therefore I played around with the options we have withSx()[]
and found out that the attributes are actually all loaded - which means the evaluation step afterwards is not needed. My problem is that I have to replaceJSHandle:
when accessing an attribute within its string.Puppeteer reference
Current issue
Here is an example, which I would like to show you. This example is the one actually failing after some 'hundred-ish' iterations.
After realizing that this is exceeding some limitations by browsers / puppeteer I have tried to optimize the evaluation to execute only once and setting the desired attribute
href
.As mentioned in the comment in code 2. the problem is that the mapped evaluation does not contain any values (tried also
el.getAttribute('href')
). So I tried to access the properties fromSx
directly in ruby viaproperty('href')
and actually got the value but prefixed withJSHandle:
- which I replaced and got it working.Usecase / Motivation
I am not sure if I am using this right or missed a conzept, but as mentioned I have a problem with
page.evaluate()
. I would like to get attributes by xpath without hacking.to_s.gsub('JSHandle:'.'')
. See the code below for my suggestion.