Source not returned - Githubissues

connorjoleary / DeepCite

Traversing links to find the deep source of information

GNU General Public License v3.0

69 stars 7 forks source link

Open connorjoleary opened 3 years ago

connorjoleary commented 3 years ago

Describe the bug true source not given as an option

To Reproduce Steps to reproduce the behavior: https://www.reddit.com/r/todayilearned/comments/n9evzh/til_theres_roughly_100_firefighter_arsonists/ full quote with that website as link

Expected behavior There is literally a quote in the source, how did deepcite miss this?

connorjoleary commented 3 years ago

Looks like that part of the page doesn't have a paragraph tag

connorjoleary commented 3 years ago

Oh man, Idk how to properly grab text in this situation.


texts = soup.findAll()
static = list(filter(self.tag_visible, texts))
``` returns duplicates of each section

soup.get_text() returns the text only once, but only the text, not the hrefs with it
[d.text for d in soup.findAll() if not d.find() and d.text] seems to be the closest, but doesn't return the line which stated this whole ticket (that line has children or descendants, idk which term is correct)

connorjoleary commented 2 years ago