antchfx / htmlquery

htmlquery is golang XPath package for HTML query.
https://github.com/antchfx/xpath
MIT License
727 stars 73 forks source link

Slow parsing of multiple xpath expression #10

Closed apancyborg closed 5 years ago

apancyborg commented 5 years ago

For the following URL : https://victorious.fandom.com/wiki/Gallery:Victorious_Cast_in_Real_Life if you do a simple find like in code blow, the parser will take a few minute to get done:

nodes := htmlquery.Find(
        doc, # the html document
        "//div/@data-src | //link/@href | //img/@src | //img/@data-src | //img/@image-src | //div/@data-imgsrc | //object/@data | //a/@href | //span/@data-href | //img/@srcset | //source/@data-srcset | //li/@data-src | //source/@srcset")

Let me know if a pprof will help or if you need anything else from me. Thank you very much for the work on this library, otherwise it works really great at scale.

zhengchun commented 5 years ago

@apancyborg , Thanks for the feedback, I am have tested your give example, looks unionQuery cause performance problem.

Consider to split this expression into the multiple sub-query, in temporary to solve this problem.

n1:=htmlquery.Find(doc, "//div/@data-src")
n2:=htmlquery.Find(doc, "//link/@href")
....
append(n1,n2,n3...)

Before elapsed: 1m39.8566735s After elapsed: 4.9502ms

apancyborg commented 5 years ago

Thanks the splitting into multiple sub-query work great on production. This solve nicely my problem.