anaskhan96 / soup

Web Scraper in Go, similar to BeautifulSoup
MIT License
2.18k stars 168 forks source link

  causes no text to be returned #8

Closed FM1337 closed 4 years ago

FM1337 commented 7 years ago

An odd issue I'm having while trying to use soup to parse Fmylife's site for FMLs is when I get an FML that has the (&)nbsp; tag

<p class="block">
<a href="/article/today-on-the-bus-i-saw-my-ex-girlfriend-get-on-despite-several-seats-being-open-she-specifically_190836.html">
<span class="icon-piment"></span>&nbsp;
[Insert FML text here] FML
</a>
</p>

when I try to call the text, it returns blank text and nothing else.

I usually call it using .Find("p", "class", "block").Find("a").Text() and if it doesn't have the whitespace tag, it returns fine.

anaskhan96 commented 7 years ago

I'll look into this. Thank you.

anaskhan96 commented 7 years ago

It's not &nbsp;, it's the span tag. I ran the code myself, and the error logged onto the console was First child not a text node, which makes sense as the first child of the tag a (the span tag) is an ElementNode, and not TextNode which causes an error to be thrown. I'll be working around on this in the Text() function to return TextNode data even when they are siblings of ElementNodes.

FM1337 commented 7 years ago

After the update, I started getting errors

2017/06/07 00:11:34 Error occurred in Text() : runtime error: invalid memory address or nil pointer dereference
2017/06/07 00:11:34 Error occurred in Text() : runtime error: invalid memory address or nil pointer dereference
2017/06/07 00:11:34 Error occurred in Text() : runtime error: invalid memory address or nil pointer dereference
2017/06/07 00:11:34 Error occurred in Text() : runtime error: invalid memory address or nil pointer dereference
2017/06/07 00:11:34 Error occurred in Text() : runtime error: invalid memory address or nil pointer dereference
2017/06/07 00:11:34 Error occurred in Text() : runtime error: invalid memory address or nil pointer dereference
2017/06/07 00:11:34 Error occurred in Text() : runtime error: invalid memory address or nil pointer dereference
2017/06/07 00:11:34 Error occurred in Text() : runtime error: invalid memory address or nil pointer dereference
2017/06/07 00:11:34 Error occurred in Text() : runtime error: invalid memory address or nil pointer dereference
2017/06/07 00:11:34 Error occurred in Text() : runtime error: invalid memory address or nil pointer dereference
2017/06/07 00:11:34 Error occurred in Text() : runtime error: invalid memory address or nil pointer dereference

Not sure why it's happening.

anaskhan96 commented 7 years ago

It seems it's returning k's NextSibling as nil and then the code is trying to access k.Type. I've redirected this to a custom panic in the latest commit. Though I'll be keeping this issue open to solve the real bug of traversing between elements in Text() function.

FM1337 commented 7 years ago

Applied latest update:

2017/06/07 11:33:10 Error occurred in Text() : No text node found
2017/06/07 11:33:10 Error occurred in Text() : No text node found
2017/06/07 11:33:10 Error occurred in Text() : No text node found
2017/06/07 11:33:10 Error occurred in Text() : No text node found
2017/06/07 11:33:10 Error occurred in Text() : No text node found
2017/06/07 11:33:10 Error occurred in Text() : No text node found

So yeah I'm seeing the custom error.

danilopolani commented 7 years ago

It works to me. Example code:

package main

import (
    "fmt"

    "github.com/anaskhan96/soup"
)

func main() {
    source := soup.HTMLParse(`<p class="block">
<a href="/article/today-on-the-bus-i-saw-my-ex-girlfriend-get-on-despite-several-seats-being-open-she-specifically_190836.html">
<span class="icon-piment"></span>&nbsp;
[Insert FML text here] FML
</a>
</p>`)

    soup.SetDebug(true)
    block := source.Find("p", "class", "block")
    fmt.Println(block.Find("a").Text())
}

Output:

 
[Insert FML text here] FML

How can I reproduce the error? Also with the Get() method on a real page I can't reproduce it.

arma7x commented 6 years ago

In my case, error reproduce when <span> does not have text inside. Like this <span></span> and my guess, error not related to &nbsp, simply check &nbsp by compare with "\u00A0".

devarsh commented 6 years ago

Have faced the similar issue, when any node is empty i.e <span></span> or <td></td>

sinramyeon commented 6 years ago
package main

import (
    "github.com/anaskhan96/soup"
    _ "github.com/anaskhan96/soup"
)

const test = `
<p class="block">
<a href="/article/today-on-the-bus-i-saw-my-ex-girlfriend-get-on-despite-several-seats-being-open-she-specifically_190836.html">
<span class="icon-piment"></span>&nbsp;
[Insert FML text here] FML
</a>
</p>
`

func main() {

    actual := soup.HTMLParse(test).Find("p", "class", "block").Find("a").Text()
    print(actual)
}

it returns [Insert FML text here] FML also.

anaskhan96 commented 4 years ago

It's been a little over 3 years since this issue was opened and well over 2 since it went stale. Closing this, will reopen if the discussion/issue arises again.