currentslab / extractnet

A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package
https://pypi.org/project/extractnet
MIT License
234 stars 22 forks source link

Does not parse the page vk.com #4

Open Vponed opened 2 years ago

Vponed commented 2 years ago
raw_html = requests.get('https://vk.com/neurosciencenews').text
results = Extractor().extract(raw_html)

It does not return almost anything. Why it can be? It works great with other sites. Also, I would like to know more about manipulations with the extractor. It is very interesting whether it is possible to obtain from it not only data, but also the way in which he extracted them.

theblackcat102 commented 2 years ago

My guess is this page is a client side generated site which the content are loaded after the website was loaded. Using requests only returns empty web page ( contents are not yet loaded ). You might need to render the page and try again.

You can view these two files for understanding how the extraction works