alirezamika / autoscraper

A Smart, Automatic, Fast and Lightweight Web Scraper for Python
MIT License
6.16k stars 648 forks source link

Defining large block of text as wanted list #40

Closed ohidurbappy closed 3 years ago

ohidurbappy commented 3 years ago

When our target value is a large block of text, it becomes messy. Instead can a feature be added so that we can define the text shortly?

For example: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum

can be defined as: Lorem ipsum(...)est laborum

alirezamika commented 3 years ago

Hey, messy in what way? You mean the results won't be the expected ones?

ohidurbappy commented 3 years ago

Hey, messy in what way? You mean the results won't be the expected ones?

@alirezamika Suppose, we have 3 block of text each with 400 words. Think about the condition, when we put 3×400 words in a script!!

alirezamika commented 3 years ago

I see. Adding support for regular expressions would be nice. You can also add them in a separate file for now.

anonscrape commented 3 years ago

Hello, I have the same problem, I'm getting a block of text by innerText and sometimes this does not get matched.

alirezamika commented 3 years ago

Hello, I have the same problem, I'm getting a block of text by innerText and sometimes this does not get matched.

I'm not sure what your problem is exactly, but you may want to adjust the text_fuzz_ratio while calling the build method.

alirezamika commented 3 years ago

In the last version (v1.1.10) you can use regular expressions as wanted items:

wanted_list = [re.compile('Lorem ipsum.+est laborum')]