WIP Rosencrantz/#153 media monitoring

alephdata / memorious

Lightweight web scraping toolkit for documents and structured data.

https://docs.alephdata.org/developers/memorious

MIT License

311 stars 59 forks source link

WIP Rosencrantz/#153 media monitoring #167

Closed Rosencrantz closed 3 years ago

Rosencrantz commented 3 years ago

VERY WIP. This is certainly not ready for merging yet, I may regret opening a pull request this early, but I wanted to raise awareness and garner feedback.

@pudo, @sunu I've made a start on adding the ability to parse articles using memorious and the newspaper3k library. So far, what these changes provide are:

And xpath rule that will evaluate to true in the event that the provided xpath finds a match in the document supplied
An article method (currently part of the parse.py instead of a separate module) which will use newspaper to extract content from a news article
A content section under parse inside the yml file that provides the ability to map fields extracted from newspaper to keys and which are then added to the data object
A new example yml file simple_article_scraper (based off simple_web_scraper_2) that makes use of the new article method

I've also added a small debugger into the dev setup (debugpy installed from worker.sh) that allows a bit of remote debugging.

Things I still want to do:

At the moment article makes a new request to the page in order to extract content. This probably should be refactored out, moving the initial Article request to fetch instead of parse perhaps?
There seem to be a couple of broken tests that need to be fixed up. related to document cloud, and something about a zip file
For some reason the ui doesn't work in dev. Looks to be related to gunicorn
Add documentation both into the code and into the docs
See how all this integrates into the ftm Article that @pudo has previously created
End to end testing

Addendum:

It's worth being mindful of the fact that the newspaper/newspaper3k library does not seem to have had any major updates in a while. Would be problematic to start relying on this only to find it holding future development back.

In my own fiddling with this problem (how to extract the useful information from an article and remove all the extra cruft) I've had it in my mind to attempt using ML. Record each page as an image and then train up something to handle information extraction. Would probably be able to handle a wider variety of news outlets without configuration with the downside being that it will not always get things right. Something to think about.

pudo commented 3 years ago

Thanks for keeping this going, very cool feature add. Some misc feedback:

I wonder if we should rename aleph_emit to aleph_document internally for clarity (we could still support aleph_emit as an alias in setup.py).
I'm not sure about the split of responsibilities between parse_article and aleph_entity: I feel parse_article should handle emitting properties that are assembled, and those just get pushed wholesale to the API in aleph_entity.
Regarding languages and countries: on the Aleph side, these are validated against followthemoney. Since memorious now also depends on that, we could very easily do a prelim normalisation/validation there and produce some good logging. I assume that's also what include_languages tries to deal with:

from followthmoney.types import registry

country_code = registry.country.clean('Russia')
lang_code = registry.language.clean('ru')

p.s. Regarding line breaks: would be awesome to set up black and adopt it's defaults for other linters (like flake8) as well :)

Rosencrantz commented 3 years ago

Possibly, although a document is an entity. I wonder whether something like aleph_ingest might work? Either way we'd need to deprecate if this is something that our users rely on in their own Aleph instances. Doubling up in the setup.py would be a good idea!

I'm still playing with parsing data. In the latest commit i've changed things around. Now you can specify a xpath to use to extract information from an article (or whatever type you want to parse). There is now a separate article.py file in the example folder that contains the parse and parse_article methods. Parse extracts content using newspaper3k. parse_article looks at the yaml file. If an xpath is defined for the property it will use it, otherwise it will default to using newspaper.

I did this because newspaper doesn't always succeed in extract all the data so I wanted to be able to manually configure it. As it stands setup could be used across multiple news sites and configured individually for each. It's also not restricted to articles. In theory I think you could parse the data into any Aleph type, you'd simply need to create a file that defines the schema and puts data into the appropriate field.

I've added an aleph_entity method to aleph.py and abstracted some of the common code into a separate method.

Rosencrantz commented 3 years ago

Small update. You can now create arbitrary entities via aleph_entity. At the moment you need to specify a schema in the yaml file along with the properties which take xpath values for their information. That data is then pushed into aleph using write_entities.