dipu-bd / lightnovel-crawler

Generate and download e-books from online sources.
https://pypi.org/project/lightnovel-crawler/
GNU General Public License v3.0
1.42k stars 279 forks source link

Cleaner blacklist_patterns #1646

Closed idMysteries closed 2 weeks ago

idMysteries commented 1 year ago
self.cleaner.blacklist_patterns.update([
            "Prev", "ToC", "Next"
        ])

do I understand correctly that if there is a "Next" in the text, then it will delete this text? It's a terrible thing.

idMysteries commented 1 year ago

<p>Next</p> and <p>Boy Next Door</p>

dipu-bd commented 1 year ago

You can pass regex there to be safe. But more things you pass, slower your crawler will be.

idMysteries commented 1 year ago

изображение

dipu-bd commented 1 year ago

If the pattern contains in a paragraph, the entire paragraph will be discarded.

idMysteries commented 1 year ago
nav_tags = contents.find_all("a", string="Table of Contents")
        for nav in nav_tags:
            nav.parent.extract()

Is it better???

idMysteries commented 1 year ago

изображение

dipu-bd commented 1 year ago

Yes, It looks better.

idMysteries commented 1 year ago

I mean there are a lot of places with blacklist_patterns in the code right now

dipu-bd commented 1 year ago

Sometimes, when texts are separated by <br> tags, you can not select elements like that. That is what made me introduce blacklist_patters.

dipu-bd commented 1 year ago
 return "\n".join(
            [
                "<p>" + x + "</p>"
                for x in body.split(LINE_SEP)
                if not self.is_in_blacklist(x.strip())
            ]
        )

Check this part. Only here the blacklist_pattern is applied. When making a paragraph, it is checking if a bad text exists in a paragraph, and skips it.

dipu-bd commented 1 year ago

The paragraph extraction work like this: The p_block_tags are considered as a line-break. and plain_text_tags are considered as a part of same paragraph. I split paragraph by p_block_tags , and join lines together to make a single paragraph when I find plain_text_tags

idMysteries commented 1 year ago

изображение So...

dipu-bd commented 1 year ago

Yeah, it will delete paragraphs with any of these texts. As far as we checked, these texts does not appear in the content body. If by any chance they appear in a paragraph, that will get deleted.

idMysteries commented 1 year ago
    def download_chapter_body(self, chapter):
        soup = self.get_soup(chapter["url"])
        contents = soup.select_one(".entry-content")

        nav_tags = contents.find_all("a", string="Table of Contents")
        for nav in nav_tags:
            nav.parent.extract()

        self.cleaner.clean_contents(contents)

        return str(contents)

Is it OK?

idMysteries commented 1 year ago

It seemed more logical to me to make not only deletion by text, but also by "tag with text".

dipu-bd commented 1 year ago

Yes, it is okay. You can go with it.

It seemed more logical to me to make not only deletion by text, but also by tag along with the text.

what if the text is not wrapped by a tag. many sources has unwanted text like that

idMysteries commented 1 year ago

what if the text is not wrapped by a tag. many sources has unwanted text like that

blacklist_patterns :)

idMysteries commented 1 year ago

blacklist_patterns for text and blacklist_tag_patterns.update([["a", "Next"]])

idMysteries commented 1 year ago

find_all(blacklist_tag_patterns[0][0], string=re.compile(blacklist_tag_patterns[0][1]))

idMysteries commented 1 year ago

can use tuple? ('a', 'Next')

idMysteries commented 1 year ago

In general, I just realized that someone, like me, can write a bad blacklist_patterns And then that someone will wonder why the paragraph is missing.

idMysteries commented 1 year ago

I'm sorry, I demand a lot from you. Even I got a little nervous. :disappointed:

dipu-bd commented 1 year ago

It is a good idea. I will change the cleaner.

dipu-bd commented 1 year ago

In general, I just realized that someone, like me, can write a bad blacklist_patterns And then that someone will wonder why the paragraph is missing.

blacklist_patterns are dangerous. I review all the code deeply that are using them. The optimal behavior is maybe to only remove the texts matching these patterns

dipu-bd commented 1 year ago

I'm sorry, I demand a lot from you. Even I got a little nervous. 😞

Hey, that's okay. If we don't share and discuss our ideas, learning and progress won't happen.

dipu-bd commented 1 year ago

I did some changes to blacklist_pattern behavior. I do not know how it will fare before testing.

https://github.com/dipu-bd/lightnovel-crawler/commit/a99680bc3d7f9b9f9e12bc8b660bd330ab6f5b77

idMysteries commented 1 year ago

Does it will make empty links? <a...>Next</a> -> <a...></a>?

idMysteries commented 1 year ago

изображение Yes... I think it will work incorrectly. ^Translator:.*$

dipu-bd commented 1 year ago

Does it will make empty links? <a...>Next</a> -> <a...></a>?

Yes, it can make empty link.

dipu-bd commented 1 year ago

изображение Yes... I think it will work incorrectly. ^Translator:.*$

I plan to revise all the sources that uses this bad_text_regex

dipu-bd commented 1 year ago

https://github.com/dipu-bd/lightnovel-crawler/commit/617dac993466cefa4d087576269b2be92f484c16

I think it should remove empty tags now. Need to test.

idMysteries commented 1 year ago

Alas, even so, I don't trust this method to remove navigation buttons. He can remove the text from the paragraph from the word "Next". And it is desirable for me to know clearly that I am deleting the link.

I don't want to leave it to chance. Although the chance of such a paragraph is very small.

idMysteries commented 1 year ago
<p>Bla bla bla... bla bla</p>
<p>Next</p>
<p>Bla bla bla... bla bla</p>

|>

<p>Bla bla bla... bla bla</p>
<p>Bla bla bla... bla bla</p>

The probability of this is small, but not zero.

idMysteries commented 1 year ago

For the rest of the cases, this is normal. But navigation buttons can be removed in a different way.

Maybe I'm being paranoid.

dipu-bd commented 1 year ago

Yes, you are right. The tag based text cleaner should be added.

idMysteries commented 1 year ago

Deleting tags with text is the right way. As a result, there were 2 methods. Deleting only text by regex, or a tag with text. You're a genius))

idMysteries commented 1 year ago

And you decided to delete the whole paragraph with the text anyway... I liked the idea of deleting only part of the paragraph better.

idMysteries commented 1 year ago

изображение https://github.com/dipu-bd/lightnovel-crawler/blob/c2d113aac647f9af37f2b414405284b0676c1675/sources/zh/uukanshu.py#L97

dipu-bd commented 1 year ago

I liked the idea of deleting only part of the paragraph better.

It will break existing crawlers.

idMysteries commented 1 year ago

If we need to delete the entire paragraph, then we can use deletion with the "p" tag. Else delete only part of the paragraph.

idMysteries commented 1 year ago

It will break existing crawlers.

You can use a temporary fix. Replace the deletion in all scanners with the deletion of the p tag

idMysteries commented 1 year ago

изображение Yes... I think it will work incorrectly. ^Translator:.*$

I plan to revise all the sources that uses this bad_text_regex

dipu-bd commented 1 year ago

From now on let's use bad_tag_text_pairs. After cleaning all bad_text_regex, we can remove the logic.

idMysteries commented 1 year ago

Ahahahaha I saw heart

dipu-bd commented 1 year ago

Ahahahaha I saw heart

touch issue. lol 😛

dipu-bd commented 1 year ago

let's keep this ticket open untill all bad_text_regex has been cleaned up.

good night.

idMysteries commented 1 year ago

From now on let's use bad_tag_text_pairs. After cleaning all bad_text_regex, we can remove the logic.

Don't remove! The logic of deleting a part of a paragraph will be useful for such cases. And there are quite a lot of them actually. Many sites insert ads inside the paragraph. изображение

<p>
Bla bla bla. Text text... <SITE NAME FOLLOW US ASS>Bla bla bla text text
</p>

|> cleaner |>

<p>
Bla bla bla. Text text... Bla bla bla text text
</p>
dipu-bd commented 1 year ago

I won't remove the logic, just the usage of it. Upon checking the existing sources, I saw there are too many places it has been misused. I will fix those.

idMysteries commented 1 year ago
        self.cleaner.bad_text_regex.update(
            {
                'a': r"""(PREVIOUS CHAPTER)
                |(CHAPTER LIST)
                |(NEXT CHAPTER)""",
                'p': r"""(FOLLOW / LIKE / SUBSCRIBE)
                |(FOLLOW AND LIKE THIS BLOG)
                |(SUBSCRIBE and LIKE)
                |(SUBSCRIBE AND LIKE)
                |(Please donate any amount to support our group!)
                |(Please donate to support our group!)"""
            }
        )

image https://ancientheartloss.wordpress.com/i-am-my-wife-chapters/chapter-1-turbulent-hours/

idMysteries commented 1 year ago

DICT IS TOO HARD!!!!! :sob: