causal-agent / scraper

HTML parsing and querying with CSS selectors
https://docs.rs/scraper
ISC License
1.81k stars 100 forks source link

How to get the number from NodeID #129

Closed NewsCli closed 1 year ago

NewsCli commented 1 year ago
  let selector = Selector::parse("title").unwrap();
  let node = docs.select(&selector).next();
  let element = node.unwrap();
  let p =element.id();

p is NodeId(5)

I want the number 5, what should i do?

adamreichold commented 1 year ago

The "5" is an internal implementation detail of the underlying tree data structure. Why do you need to access it?

NewsCli commented 1 year ago

收到,谢谢。祝您生活愉快

NewsCli commented 1 year ago

The "5" is an internal implementation detail of the underlying tree data structure. Why do you need to access it?

Actually, I want to access the position of the HTML Document(or DOM Tree), such as <title></title> in the 37th position of the HTML DOM Tree

adamreichold commented 1 year ago

The underlying tree data structure is ego-tree which gives you:

How NodeId and NodeRef are implemented internally (i.e. that they are indices/references into a list of nodes) is an implementation detail which one normally does not depend on.

I guess there is a bit of a language barrier involved, but I think a straight-forward translation of

in the 37th position of the HTML DOM Tree

into code would be

let node = docs.tree.nodes().nth(37).unwrap();

but it is not clear to me how to end up with "37" in the first place.

From your example code, I would guess that you want to store p itself instead of "5" and turn it back into a NodeRef via Tree::get.

NewsCli commented 1 year ago

Base on our work purpose, we have decided to maintain a Node Tree ourselves, which can provide more stable support for our work and enable us to adjust our strategies more flexibly.

Thanks!

adamreichold commented 1 year ago

I am sorry to hear that as things tend to work out better if we collaborate on upstream projects. I still think there is an XY problem involved here as we have not yet reached a common understanding as to why "the position of the node within the DOM tree" is required in the first place.