Open vzeazy opened 1 year ago
The following worked for me,
wanted_dict = {
"title": ["Possible to to try to extract main article from a page?"],
"meta": ["vzeazy"],
"content": ['Given the examples, struggle to understand how this may be utilized to extract the main article from a page. In this case, the sample would be the article content itself. Would be great if it could use several samples from other websites and then develop a generalized pattern for additional pages. Guessing this my be out of scope for this project.']
}
html_file = open('sample/train.html', 'r', encoding='utf-8')
source_code = html_file.read()
result = scraper.build(html=source_code, wanted_dict=wanted_dict)
scraper.save('github')
html_file = open('sample/test.html', 'r', encoding='utf-8')
source_code = html_file.read()
result=scraper.get_result_exact(html=source_code)
Given the examples, struggle to understand how this may be utilized to extract the main article from a page. In this case, the sample would be the article content itself. Would be great if it could use several samples from other websites and then develop a generalized pattern for additional pages. Guessing this my be out of scope for this project.