alephdata / memorious

Lightweight web scraping toolkit for documents and structured data.
https://docs.alephdata.org/developers/memorious
MIT License
311 stars 59 forks source link

WIP Rosencrantz/#153 media monitoring #167

Closed Rosencrantz closed 3 years ago

Rosencrantz commented 3 years ago

VERY WIP. This is certainly not ready for merging yet, I may regret opening a pull request this early, but I wanted to raise awareness and garner feedback.

@pudo, @sunu I've made a start on adding the ability to parse articles using memorious and the newspaper3k library. So far, what these changes provide are:

I've also added a small debugger into the dev setup (debugpy installed from worker.sh) that allows a bit of remote debugging.

Things I still want to do:

Addendum:

It's worth being mindful of the fact that the newspaper/newspaper3k library does not seem to have had any major updates in a while. Would be problematic to start relying on this only to find it holding future development back.

In my own fiddling with this problem (how to extract the useful information from an article and remove all the extra cruft) I've had it in my mind to attempt using ML. Record each page as an image and then train up something to handle information extraction. Would probably be able to handle a wider variety of news outlets without configuration with the downside being that it will not always get things right. Something to think about.

pudo commented 3 years ago

Thanks for keeping this going, very cool feature add. Some misc feedback:

from followthmoney.types import registry

country_code = registry.country.clean('Russia')
lang_code = registry.language.clean('ru')

p.s. Regarding line breaks: would be awesome to set up black and adopt it's defaults for other linters (like flake8) as well :)

Rosencrantz commented 3 years ago

Possibly, although a document is an entity. I wonder whether something like aleph_ingest might work? Either way we'd need to deprecate if this is something that our users rely on in their own Aleph instances. Doubling up in the setup.py would be a good idea!

I'm still playing with parsing data. In the latest commit i've changed things around. Now you can specify a xpath to use to extract information from an article (or whatever type you want to parse). There is now a separate article.py file in the example folder that contains the parse and parse_article methods. Parse extracts content using newspaper3k. parse_article looks at the yaml file. If an xpath is defined for the property it will use it, otherwise it will default to using newspaper.

I did this because newspaper doesn't always succeed in extract all the data so I wanted to be able to manually configure it. As it stands setup could be used across multiple news sites and configured individually for each. It's also not restricted to articles. In theory I think you could parse the data into any Aleph type, you'd simply need to create a file that defines the schema and puts data into the appropriate field.

I've added an aleph_entity method to aleph.py and abstracted some of the common code into a separate method.

Rosencrantz commented 3 years ago

Small update. You can now create arbitrary entities via aleph_entity. At the moment you need to specify a schema in the yaml file along with the properties which take xpath values for their information. That data is then pushed into aleph using write_entities.