alirezamika / autoscraper

A Smart, Automatic, Fast and Lightweight Web Scraper for Python
MIT License
6.16k stars 648 forks source link

Possible to to try to extract main article from a page? #86

Open vzeazy opened 1 year ago

vzeazy commented 1 year ago

Given the examples, struggle to understand how this may be utilized to extract the main article from a page. In this case, the sample would be the article content itself. Would be great if it could use several samples from other websites and then develop a generalized pattern for additional pages. Guessing this my be out of scope for this project.

entrptaher commented 1 year ago

The following worked for me,

wanted_dict = {
    "title": ["Possible to to try to extract main article from a page?"],
    "meta": ["vzeazy"],
    "content": ['Given the examples, struggle to understand how this may be utilized to extract the main article from a page. In this case, the sample would be the article content itself. Would be great if it could use several samples from other websites and then develop a generalized pattern for additional pages. Guessing this my be out of scope for this project.']
}

html_file = open('sample/train.html', 'r', encoding='utf-8')
source_code = html_file.read()
result = scraper.build(html=source_code, wanted_dict=wanted_dict)
scraper.save('github')

html_file = open('sample/test.html', 'r', encoding='utf-8')
source_code = html_file.read()
result=scraper.get_result_exact(html=source_code)