James-LG / Skyscraper

Rust library for scraping HTML using XPath expressions
MIT License
30 stars 3 forks source link

xpath: xpath parser causes stack overflow #25

Closed ArtemGolovko closed 5 months ago

ArtemGolovko commented 5 months ago

I encountered stack overflow when parsing //head//script[not(@src) and not(contains(., "gtag"))];

I suspect that the parser might get in infinite loop. Or any long xpath expression can cause stack overflow.

James-LG commented 5 months ago

I'm not able to reproduce this. I have a test passing that looks like this

#[test]
fn parse_should_handle_multiple_nots_in_predicate() {
    // arrange
    let xpath_text = r###"//head//script[not(@src) and not(contains(., "gtag"))]"###;

    // act
    let xpath = crate::xpath::parse(xpath_text).unwrap();

    // assert
    assert_eq!(xpath.to_string(), xpath_text);
}

What function were you calling exactly to cause the stack overflow? Is the overflow when you are applying the expression maybe?

ArtemGolovko commented 5 months ago

I'm sorry I did not include bug reproduction.

Here it goes.

use skyscraper::{html, xpath::{self, XpathItemTree}};

fn main() {
    let document = html::parse(include_str!("../annupay.htm")).unwrap();
    let xpath_item_tree = XpathItemTree::from(&document);
    let xpath = xpath::parse(r###"//head//script[not(@src) and not(contains(., "gtag"))]"###).unwrap();
}

You can comment out the last line of main and see overflow go away.

I think instead of my html file (annupay.html), any other html file, of similar size, can be used. In my case my html file is 12 kb. If it is not the case, I will make sure to investigate my html file.

James-LG commented 5 months ago

Strange, I created a new rust project and used your exact code with some random html file and it works fine for me.

Does it fail for you with just the third line?

fn main() {
    let xpath =
        xpath::parse(r###"//head//script[not(@src) and not(contains(., "gtag"))]"###).unwrap();
}
ArtemGolovko commented 5 months ago

You're right, it's the last line of main, that causes the issue. The bug appears in debug build, but not in release build.

James-LG commented 5 months ago

Interesting, I think that means your right on the stack size limit and the release mode is optimizing it enough to fit in your stack. What kind of machine are you running the code on, is it a normal PC or some smaller device?

Try running the code in a thread with a higher stack size like in this stack overflow response https://stackoverflow.com/a/44042122

ArtemGolovko commented 5 months ago

Thank you, will try increasing the stack size.

And here is my stats.

Stat Value
CPU Intel Core i5-6200U
RAM 12 GB
OS Windows 10
rustc version 1.74.0

I don't think that my hardware is the problem. It might be a rustc bug.

ArtemGolovko commented 5 months ago

I check the stack size required is 1 mb 65 kb. For some reason in debug mode it fails to correctly allocate stack.

James-LG commented 5 months ago

I am able to reproduce this on Windows. I'll see what I can do.

James-LG commented 5 months ago

I think I have fixed this in #27

Try using version 0.6.1-beta.0 and let me know.

ArtemGolovko commented 5 months ago

Thank you. I test your fix and it works fine.