Closed idMysteries closed 2 months ago
<p>Next</p>
and <p>Boy Next Door</p>
You can pass regex
there to be safe. But more things you pass, slower your crawler will be.
If the pattern contains in a paragraph, the entire paragraph will be discarded.
nav_tags = contents.find_all("a", string="Table of Contents")
for nav in nav_tags:
nav.parent.extract()
Is it better???
Yes, It looks better.
I mean there are a lot of places with blacklist_patterns in the code right now
Sometimes, when texts are separated by <br>
tags, you can not select elements like that. That is what made me introduce blacklist_patters.
return "\n".join(
[
"<p>" + x + "</p>"
for x in body.split(LINE_SEP)
if not self.is_in_blacklist(x.strip())
]
)
Check this part. Only here the blacklist_pattern is applied. When making a paragraph, it is checking if a bad text exists in a paragraph, and skips it.
The paragraph extraction work like this: The p_block_tags
are considered as a line-break. and plain_text_tags
are considered as a part of same paragraph. I split paragraph by p_block_tags
, and join lines together to make a single paragraph when I find plain_text_tags
So...
Yeah, it will delete paragraphs with any of these texts. As far as we checked, these texts does not appear in the content body. If by any chance they appear in a paragraph, that will get deleted.
def download_chapter_body(self, chapter):
soup = self.get_soup(chapter["url"])
contents = soup.select_one(".entry-content")
nav_tags = contents.find_all("a", string="Table of Contents")
for nav in nav_tags:
nav.parent.extract()
self.cleaner.clean_contents(contents)
return str(contents)
Is it OK?
It seemed more logical to me to make not only deletion by text, but also by "tag with text".
Yes, it is okay. You can go with it.
It seemed more logical to me to make not only deletion by text, but also by tag along with the text.
what if the text is not wrapped by a tag. many sources has unwanted text like that
what if the text is not wrapped by a tag. many sources has unwanted text like that
blacklist_patterns :)
blacklist_patterns for text and blacklist_tag_patterns.update([["a", "Next"]])
find_all(blacklist_tag_patterns[0][0], string=re.compile(blacklist_tag_patterns[0][1]))
can use tuple? ('a', 'Next')
In general, I just realized that someone, like me, can write a bad blacklist_patterns And then that someone will wonder why the paragraph is missing.
I'm sorry, I demand a lot from you. Even I got a little nervous. :disappointed:
It is a good idea. I will change the cleaner.
In general, I just realized that someone, like me, can write a bad blacklist_patterns And then that someone will wonder why the paragraph is missing.
blacklist_patterns are dangerous. I review all the code deeply that are using them. The optimal behavior is maybe to only remove the texts matching these patterns
I'm sorry, I demand a lot from you. Even I got a little nervous. 😞
Hey, that's okay. If we don't share and discuss our ideas, learning and progress won't happen.
I did some changes to blacklist_pattern behavior. I do not know how it will fare before testing.
https://github.com/dipu-bd/lightnovel-crawler/commit/a99680bc3d7f9b9f9e12bc8b660bd330ab6f5b77
Does it will make empty links?
<a...>Next</a>
-> <a...></a>
?
Yes... I think it will work incorrectly. ^Translator:.*$
Does it will make empty links?
<a...>Next</a>
-><a...></a>
?
Yes, it can make empty link.
Yes... I think it will work incorrectly. ^Translator:.*$
I plan to revise all the sources that uses this bad_text_regex
https://github.com/dipu-bd/lightnovel-crawler/commit/617dac993466cefa4d087576269b2be92f484c16
I think it should remove empty tags now. Need to test.
Alas, even so, I don't trust this method to remove navigation buttons. He can remove the text from the paragraph from the word "Next". And it is desirable for me to know clearly that I am deleting the link.
I don't want to leave it to chance. Although the chance of such a paragraph is very small.
<p>Bla bla bla... bla bla</p>
<p>Next</p>
<p>Bla bla bla... bla bla</p>
|>
<p>Bla bla bla... bla bla</p>
<p>Bla bla bla... bla bla</p>
The probability of this is small, but not zero.
For the rest of the cases, this is normal. But navigation buttons can be removed in a different way.
Maybe I'm being paranoid.
Yes, you are right. The tag based text cleaner should be added.
Deleting tags with text is the right way. As a result, there were 2 methods. Deleting only text by regex, or a tag with text. You're a genius))
And you decided to delete the whole paragraph with the text anyway... I liked the idea of deleting only part of the paragraph better.
I liked the idea of deleting only part of the paragraph better.
It will break existing crawlers.
If we need to delete the entire paragraph, then we can use deletion with the "p" tag. Else delete only part of the paragraph.
It will break existing crawlers.
You can use a temporary fix. Replace the deletion in all scanners with the deletion of the p tag
Yes... I think it will work incorrectly. ^Translator:.*$
I plan to revise all the sources that uses this
bad_text_regex
From now on let's use bad_tag_text_pairs
. After cleaning all bad_text_regex
, we can remove the logic.
Ahahahaha I saw heart
Ahahahaha I saw heart
touch issue. lol 😛
let's keep this ticket open untill all bad_text_regex has been cleaned up.
good night.
From now on let's use
bad_tag_text_pairs
. After cleaning allbad_text_regex
, we can remove the logic.
Don't remove! The logic of deleting a part of a paragraph will be useful for such cases. And there are quite a lot of them actually. Many sites insert ads inside the paragraph.
<p>
Bla bla bla. Text text... <SITE NAME FOLLOW US ASS>Bla bla bla text text
</p>
|> cleaner |>
<p>
Bla bla bla. Text text... Bla bla bla text text
</p>
I won't remove the logic, just the usage of it. Upon checking the existing sources, I saw there are too many places it has been misused. I will fix those.
self.cleaner.bad_text_regex.update(
{
'a': r"""(PREVIOUS CHAPTER)
|(CHAPTER LIST)
|(NEXT CHAPTER)""",
'p': r"""(FOLLOW / LIKE / SUBSCRIBE)
|(FOLLOW AND LIKE THIS BLOG)
|(SUBSCRIBE and LIKE)
|(SUBSCRIBE AND LIKE)
|(Please donate any amount to support our group!)
|(Please donate to support our group!)"""
}
)
https://ancientheartloss.wordpress.com/i-am-my-wife-chapters/chapter-1-turbulent-hours/
DICT IS TOO HARD!!!!! :sob:
do I understand correctly that if there is a "Next" in the text, then it will delete this text? It's a terrible thing.