alephdata / memorious

Lightweight web scraping toolkit for documents and structured data.
https://docs.alephdata.org/developers/memorious
MIT License
311 stars 59 forks source link

Rosencrantz/#153 media monitoring #176

Closed Rosencrantz closed 3 years ago

Rosencrantz commented 3 years ago

From @pudo:

Thanks for keeping this going, very cool feature add. Some misc feedback:

I wonder if we should rename aleph_emit to aleph_document internally for clarity (we could still support aleph_emit as an alias in setup.py). I'm not sure about the split of responsibilities between parse_article and aleph_entity: I feel parse_article should handle emitting properties that are assembled, and those just get pushed wholesale to the API in aleph_entity. Regarding languages and countries: on the Aleph side, these are validated against followthemoney. Since memorious now also depends on that, we could very easily do a prelim normalisation/validation there and produce some good logging. I assume that's also what include_languages tries to deal with: from followthmoney.types import registry

country_code = registry.country.clean('Russia') lang_code = registry.language.clean('ru') p.s. Regarding line breaks: would be awesome to set up black and adopt it's defaults for other linters (like flake8) as well :)

Rosencrantz commented 3 years ago

@kjacks @sunu I want to merge this. Yes, I'm aware it does not currently have any NER capabilities but if we leave it un-merged then it'll simply end up going stale and we'll never use it. Unless there are any really good reasons to the contrary I'd like to put this onto master.

Rosencrantz commented 3 years ago

@Rosencrantz I have left some comments regarding the aleph_entity method.

Also naming wise, I think aleph_emit_entity and aleph_emit_document would be better names. And after a while we can remove aleph_emit in favour of aleph_emit_document. What do you think?

Good idea's I will create aleph_emit_entity and aleph_emit_document. For the time being we will have to retain aleph_emit, as there will be people (including ourselves) using it. We should publish a deprecation notice with these changes signalling our intent to remove it.