Wrong parsing in some sites

adamdruppe / arsd

This is a collection of modules that I've released over the years. Most of them stand alone, or have just one or two dependencies in here, so you don't have to download this whole repo.

http://arsd-official.dpldocs.info/arsd.html

531 stars 128 forks source link

Wrong parsing in some sites #50

Closed TrueBers closed 10 years ago

TrueBers commented 10 years ago

I have encountered an error in dom.d while parsing http://www.avito.ru/rostov-na-donu . I tried to get some elements by querySelector after parseGarbage, e.g. .link-count, and got null returned while it works fine in web browsers or just curl http://www.avito.ru/rostov-na-donu | grep 'link-count'.

adamdruppe commented 10 years ago

I'll look into this in a day or two, pretty busy this week, but I do see the issue.

On Wed, Apr 09, 2014 at 03:26:53AM -0700, Dmitry Mostovenko wrote:

I have encountered an error in dom.d while parsing http://www.avito.ru/rostov-na-donu . I tried to get some elements by querySelector after parseGarbage, e.g. .link-count, and got null returned while it works fine in web browsers or just curl http://www.avito.ru/rostov-na-donu | grep 'link-count'.

Reply to this email directly or view it on GitHub:

https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff/issues/50

adamdruppe commented 10 years ago

Sorry for the wait, I've been busy with a new job.

Anyway, the problem with this site is that it doesn't have a tag!

As a temporary workaround you can parse it with this:

    document.parseGarbage("<html>"~readText("thatfile.html")~"</html>");

Just adding the html tags to the outside yourself. In the longer term, maybe dom.d can detect that and add them itself.

TrueBers commented 10 years ago

Oh, it really doesn't! It's my fault is that I didn't notice such a stupid fact before creating the issue. Thank you! Sorry for distraction from work =). PS It seems it doesn't have a closing tag only, the opening one is in place.