Cleaner blacklist_patterns

idMysteries commented 2 years ago

self.cleaner.blacklist_patterns.update([
            "Prev", "ToC", "Next"
        ])

do I understand correctly that if there is a "Next" in the text, then it will delete this text? It's a terrible thing.

idMysteries commented 2 years ago

Next and Boy Next Door

dipu-bd commented 2 years ago

You can pass regex there to be safe. But more things you pass, slower your crawler will be.

idMysteries commented 2 years ago

dipu-bd commented 2 years ago

If the pattern contains in a paragraph, the entire paragraph will be discarded.

idMysteries commented 2 years ago

nav_tags = contents.find_all("a", string="Table of Contents")
        for nav in nav_tags:
            nav.parent.extract()

Is it better???

idMysteries commented 2 years ago

dipu-bd commented 2 years ago

Yes, It looks better.

idMysteries commented 2 years ago

I mean there are a lot of places with blacklist_patterns in the code right now

dipu-bd commented 2 years ago

Sometimes, when texts are separated by   tags, you can not select elements like that. That is what made me introduce blacklist_patters.

dipu-bd commented 2 years ago

 return "\n".join(
            [
                "<p>" + x + "</p>"
                for x in body.split(LINE_SEP)
                if not self.is_in_blacklist(x.strip())
            ]
        )

Check this part. Only here the blacklist_pattern is applied. When making a paragraph, it is checking if a bad text exists in a paragraph, and skips it.

dipu-bd commented 2 years ago

The paragraph extraction work like this: The p_block_tags are considered as a line-break. and plain_text_tags are considered as a part of same paragraph. I split paragraph by p_block_tags , and join lines together to make a single paragraph when I find plain_text_tags

idMysteries commented 2 years ago

So...

dipu-bd commented 2 years ago

Yeah, it will delete paragraphs with any of these texts. As far as we checked, these texts does not appear in the content body. If by any chance they appear in a paragraph, that will get deleted.

idMysteries commented 2 years ago

    def download_chapter_body(self, chapter):
        soup = self.get_soup(chapter["url"])
        contents = soup.select_one(".entry-content")

        nav_tags = contents.find_all("a", string="Table of Contents")
        for nav in nav_tags:
            nav.parent.extract()

        self.cleaner.clean_contents(contents)

        return str(contents)

Is it OK?

idMysteries commented 2 years ago

It seemed more logical to me to make not only deletion by text, but also by "tag with text".

dipu-bd commented 2 years ago

Yes, it is okay. You can go with it.

It seemed more logical to me to make not only deletion by text, but also by tag along with the text.

what if the text is not wrapped by a tag. many sources has unwanted text like that

idMysteries commented 2 years ago

what if the text is not wrapped by a tag. many sources has unwanted text like that

blacklist_patterns :)

idMysteries commented 2 years ago

blacklist_patterns for text and blacklist_tag_patterns.update([["a", "Next"]])

idMysteries commented 2 years ago

find_all(blacklist_tag_patterns[0][0], string=re.compile(blacklist_tag_patterns[0][1]))

idMysteries commented 2 years ago

can use tuple? ('a', 'Next')

idMysteries commented 2 years ago

In general, I just realized that someone, like me, can write a bad blacklist_patterns And then that someone will wonder why the paragraph is missing.

idMysteries commented 2 years ago

I'm sorry, I demand a lot from you. Even I got a little nervous. :disappointed:

dipu-bd commented 2 years ago

It is a good idea. I will change the cleaner.

dipu-bd commented 2 years ago

In general, I just realized that someone, like me, can write a bad blacklist_patterns And then that someone will wonder why the paragraph is missing.

blacklist_patterns are dangerous. I review all the code deeply that are using them. The optimal behavior is maybe to only remove the texts matching these patterns

dipu-bd commented 2 years ago

I'm sorry, I demand a lot from you. Even I got a little nervous. 😞

Hey, that's okay. If we don't share and discuss our ideas, learning and progress won't happen.

dipu-bd commented 2 years ago

I did some changes to blacklist_pattern behavior. I do not know how it will fare before testing.

https://github.com/dipu-bd/lightnovel-crawler/commit/a99680bc3d7f9b9f9e12bc8b660bd330ab6f5b77

idMysteries commented 2 years ago

Does it will make empty links? <a...>Next</a> -> <a...></a>?

idMysteries commented 2 years ago

Yes... I think it will work incorrectly. ^Translator:.*$

dipu-bd commented 2 years ago

Does it will make empty links? <a...>Next</a> -> <a...></a>?

Yes, it can make empty link.

dipu-bd commented 2 years ago

Yes... I think it will work incorrectly. ^Translator:.*$

I plan to revise all the sources that uses this bad_text_regex

dipu-bd commented 2 years ago

https://github.com/dipu-bd/lightnovel-crawler/commit/617dac993466cefa4d087576269b2be92f484c16

I think it should remove empty tags now. Need to test.

idMysteries commented 2 years ago

Alas, even so, I don't trust this method to remove navigation buttons. He can remove the text from the paragraph from the word "Next". And it is desirable for me to know clearly that I am deleting the link.

I don't want to leave it to chance. Although the chance of such a paragraph is very small.

idMysteries commented 2 years ago

<p>Bla bla bla... bla bla</p>
<p>Next</p>
<p>Bla bla bla... bla bla</p>

|>

<p>Bla bla bla... bla bla</p>
<p>Bla bla bla... bla bla</p>

The probability of this is small, but not zero.

idMysteries commented 2 years ago

For the rest of the cases, this is normal. But navigation buttons can be removed in a different way.

Maybe I'm being paranoid.

dipu-bd commented 2 years ago

Yes, you are right. The tag based text cleaner should be added.

idMysteries commented 2 years ago

Deleting tags with text is the right way. As a result, there were 2 methods. Deleting only text by regex, or a tag with text. You're a genius))

idMysteries commented 2 years ago

And you decided to delete the whole paragraph with the text anyway... I liked the idea of deleting only part of the paragraph better.

idMysteries commented 2 years ago

https://github.com/dipu-bd/lightnovel-crawler/blob/c2d113aac647f9af37f2b414405284b0676c1675/sources/zh/uukanshu.py#L97

dipu-bd commented 2 years ago

I liked the idea of deleting only part of the paragraph better.

It will break existing crawlers.

idMysteries commented 2 years ago

If we need to delete the entire paragraph, then we can use deletion with the "p" tag. Else delete only part of the paragraph.

idMysteries commented 2 years ago

It will break existing crawlers.

You can use a temporary fix. Replace the deletion in all scanners with the deletion of the p tag

idMysteries commented 2 years ago

Yes... I think it will work incorrectly. ^Translator:.*$

I plan to revise all the sources that uses this bad_text_regex

dipu-bd commented 2 years ago

From now on let's use bad_tag_text_pairs. After cleaning all bad_text_regex, we can remove the logic.

idMysteries commented 2 years ago

Ahahahaha I saw heart

dipu-bd commented 2 years ago

Ahahahaha I saw heart

touch issue. lol 😛

dipu-bd commented 2 years ago

let's keep this ticket open untill all bad_text_regex has been cleaned up.

good night.

idMysteries commented 2 years ago

From now on let's use bad_tag_text_pairs. After cleaning all bad_text_regex, we can remove the logic.

Don't remove! The logic of deleting a part of a paragraph will be useful for such cases. And there are quite a lot of them actually. Many sites insert ads inside the paragraph.

<p>
Bla bla bla. Text text... <SITE NAME FOLLOW US ASS>Bla bla bla text text
</p>

|> cleaner |>

<p>
Bla bla bla. Text text... Bla bla bla text text
</p>

dipu-bd commented 2 years ago

I won't remove the logic, just the usage of it. Upon checking the existing sources, I saw there are too many places it has been misused. I will fix those.

idMysteries commented 2 years ago

        self.cleaner.bad_text_regex.update(
            {
                'a': r"""(PREVIOUS CHAPTER)
                |(CHAPTER LIST)
                |(NEXT CHAPTER)""",
                'p': r"""(FOLLOW / LIKE / SUBSCRIBE)
                |(FOLLOW AND LIKE THIS BLOG)
                |(SUBSCRIBE and LIKE)
                |(SUBSCRIBE AND LIKE)
                |(Please donate any amount to support our group!)
                |(Please donate to support our group!)"""
            }
        )

https://ancientheartloss.wordpress.com/i-am-my-wife-chapters/chapter-1-turbulent-hours/

idMysteries commented 2 years ago

DICT IS TOO HARD!!!!! :sob:

dipu-bd / lightnovel-crawler

Cleaner blacklist_patterns #1646