Add ability to identify and respond with a list of entities given an input string of a classical text

lukehollis commented 8 years ago

For a given input string, we need to identity and respond with a list of entities (perhaps also their positions in the string).

A working function..

Accepts an input string
Identifies named entities in the input string
Associates named entities to some external data sources (maybe wikipedia or VIAF?)
Returns a list of entities serialized as JSON

We have a little bit of work done here that I used on an earlier project: https://github.com/cltk/cltk_api/tree/master/metadata/entities We should only keep what is useful of this and delete every line of code that is not. I will filter through the existing files and be more judicious in removing what's not applicable and documenting what is.

lukehollis commented 8 years ago

Okay, I've cleaned up the entities metadata module a little bit more to retrieve data from wikipedia that we can render to the frontend: https://github.com/cltk/cltk_api/blob/feature/expose_cltk_core/metadata/entities/entity.py

There should be separate URIs for retrieving just the information from the core CLTK module and for retrieving more data/metadata from different sources.

manu-chroma commented 8 years ago

@lukehollis I would like to work on this if anyone else is not assigned already.

Imran31 commented 8 years ago

I am working on this (ref https://github.com/cltk/cltk_api/issues/20#issuecomment-198595178).

Imran31 commented 8 years ago

Hi @manu-chroma, I no longer have time to work on this since I'll be cleaning up https://github.com/cltk/cltk_api/pull/27 and preparing my proposal.

Please feel free to take this up!

manu-chroma commented 8 years ago

@lukehollis @kylepjohnson I believe that if we have a string as a parameter, a google search would reveal much more relevant results. We can either scrape the wikipedia link from the results page (if any), or add a list of trusted/useful websites and if any of them come up in the results we scrape the sites or just their URL. Accordingly, we can serve the JSON output as URL of the relevant websites or scrape summarized data from them and serve that. What do you people think?

lukehollis commented 8 years ago

I was thinking about this python module: https://pypi.python.org/pypi/wikipedia

Something like this: https://github.com/cltk/cltk_api/blob/feature/expose_cltk_core/metadata/entities/wikipedia.py#L18

I think mining search engine results or other sources sounds great as well, but I believe that wikipedia will be an important organization to link to and align with given our project's mission. VIAF (https://viaf.org/) seems useful too.

Ultimately, I'm not as concerned about what external data sources we aggregate for each entity as long as we have a diversified viewpoint when possible to regularize mistakes in disambiguations.

Entities like Io, especially, have been hard for me to work with in the past.

kylepjohnson commented 8 years ago

Hey guys, I like the way things are developing.

A few things:

Querying Google's search engine is a creative, but terrible, idea :open_mouth: This could have a bad cascading effect for the project.
Wikipedia is obviously a great start and if this library makes life easier for you, then by all means add it to the api's dependencies
Question: @lukehollis, would you give a few examples of what kind of info you want to extract from (say) Wikipedia or other sites?

manu-chroma commented 8 years ago

For example: If a user enter query as "Aeneas", we can easily create a wiki url, and scrape data form the page and return relevant data. (As mentioned in https://github.com/cltk/cltk_api/blob/feature/expose_cltk_core/metadata/entities/wikipedia.py#L8) But if another user looking for the same data queries a string "Trojan hero Aeneas" or "Aeneas greek" or "Aeneas roman", I think the program will render incorrect results. This is my main concern with this approach.

lukehollis commented 8 years ago

I think in the initial stages we have to balance rendering incorrect results with manual review of the entities that are ingested. To my mind, the benefits outweigh the offsets with this, but I can be talked down from linking wikipedia data.

The advantages to the end user are what I'm thinking about. Mostly I think that best advantage would be this user story:

As a reader of an ancient text, I want to know the geographical coordinates / biography / media related to this place/person/etc. that I haven't heard of so that I can actually understand what I'm reading.

What's the best option for this? I don't think there's any perfect answer, so maybe a multiplicity of resources the best--viz. "in a wealth of counselors there is wisdom"?

manu-chroma commented 8 years ago

I think wikipedia is great idea. The json format mentioned is https://github.com/cltk/cltk_api/blob/feature/expose_cltk_core/metadata/entities/wikipedia.py#L8 pretty decent.

But to obtain the correct wikipedia link is the problem. Parsing the string to obtain the actual meaningful query info is the real challenge here. I mentioned google because even though how mixed up our query is, google would provide the correct wiki link to extract information from. But as @kylepjohnson said, google is a bad idea. I think we should explore other ideas for the same.

Maybe use basic NLTK to tokenise the query and further extract meaningful info from it ? @lukehollis What do you think ?

lukehollis commented 8 years ago

I think @ferthalangur has the best perspective here. To my mind, the simplest first iteration of this should offer the first wikipedia result that can be manually revisited by a content administrator on the frontend when necessary.

If we want to leave VIAF/Pleiades, etc. here, let's delete their high-level stubs as soon as possible.

manu-chroma commented 8 years ago

@lukehollis What high-level stubs are you referring to ?

ferthalangur commented 8 years ago

I think what @manu-chroma is describing is a problem of context. If your term is something like "Asp", you could get a link to a fish, a snake, an orchid, a lake in Minnesota, a pistol, or a German Goth band ... just to name a few examples. To be useful, it is necessary to link an entity to a disambiguated Wikipedia entry. If you had a few more words around the term, you might have enough context to disambiguate it. If you mixed in some "meta" metadata (e.g., the line is from a text by Vergil, so it probably isn't the pistol, the lake or the band in this particular Universe) it would help to get to the correct entry in Wikipedia. However, it seems like you might have to submit queries to Wikipedia, recognize that they are disambiguation pages, and then parse the various pages with the known metadata in order to get to the correct one.

How insane would it be to pre-process texts as they are added to the system to store meaningful associations, and relevant links, for terms that would be entities? Or maybe even terms that could be entities. Or is that what we are talking about doing here?

I am thinking that at some point there is going to have to be an element of human intervention to ensure that the associated entities are the same, and that the contextual materials attached to them (images, links to more information, etc) are relevant. It might make sense then, to include in the metadata about an entity a confidence value parameter for that association.

There will need to be some disambiguation of entities with the same textual name too. I am not well-read in the Classics, so the only example I can think of off the top of my head would be "Pliny" (the Elder and the Younger) ... So you have a text that makes reference to "Gaius Plinius" ... and let's assume that this could refer to either Pliny ... in your initial entity identification, you'll have to either "guess" [again, "Meta" metadata might help here], or someone will have to attach an association to which entity is being referenced. Initially (the first time the text is encountered) there will be no stored metadata for that chunk of text. If the API supports an 'automated search' with some back-end processing, this could then be attached with a confidence value of "guess" or "automated" or "${insert name of algorithm used}". {{The reason for being quite specific in how that association was created is because at some point in the future, there may be better pre-processors that can read texts and do a much better job of making those associations}} At some later date, perhaps the data will be cleaned up by a human being and the association would have a higher confidence value.

I did make mention to @lukehollis that ViAF seems to have a lot of gaps, but that is true of most canonical data sources unless they have been specifically prepared, and their quality deteriorates geometrically as their scope increases. Take a look at how inconsistent the subject or author indexes of your favorite academic library catalog is some time.

Hmmm ... looking back at what Luke said ... there could be a "MIGHT BE" value for the confidence that can actually link to two or more ambiguous associations until the ambiguity has been resolved.

manu-chroma commented 8 years ago

@lukehollis I've made minor changes in to make remove any incomplete output. PR https://github.com/cltk/cltk_api/pull/31 Wikipedia package is a little buggy but pretty convenient. I'll look ways to optimize results using that. Tried the suggest feature. Not very helpful. I was thinking of exposing the existing wikipedia summary data functionality through entities route.

GET /metadata/entities/wiki/define?q=<word>

What do you say ?

kylepjohnson commented 8 years ago

Hey guys, I fell away from this repo for a little bit. @lukehollis do you want to follow up on this, since you have a better grasp of the the frontend's needs?

lukehollis commented 8 years ago

Hey @manu-chroma and @kylepjohnson, yes, that sounds good. Thanks for your patience!

I also found the suggest feature to be pretty lacking for our needs. I think what you've done with the define endpoint looks good, @manu-chroma. For the needs of the frontend, we just need something functional instead of perfect. I imagine that the frontend will save data from your work here, @manu-chroma, for every entity that is mined and returned from the CLTK NER. We can check a list of regularized entities (and their declined forms) in our database for each entity that we receive from the other entity API endpoint, and if it isn't found, we can try at least to fetch the wikipedia information via our API.

The biggest enhancement for the reading experience from this actually, I would argue, is the wikipedia images because they make all the weird names and patronymics that much more memorable.

Reviewing the documentation on the CLTK core NER, I think our biggest challenge will be regularizing naming--but that doesn't have to do with the functionality of this issue.

manu-chroma commented 8 years ago

@lukehollis I really like your idea of using CLTK NER and thus having a good list of words which people might actually search for. And already put them in the front end database.

I was thinking of another way of integrating CLTK NER. To identify keywords before sending them as a query to the wikipedia module ? Maybe that will help in increasing accuracy of the search. What do you think ?

Also, I'll expose this search functionality through the above mentioned URL and write required tests for the same. :+1:

lukehollis commented 8 years ago

Apologies for the delay! I think increasing the accuracy of the search via identifying keywords (maybe some amount of regularization for lemmatization of names..?) would be really awesome. That would be groundbreaking for us as we try to pull this data into the frontend.

manu-chroma commented 8 years ago

@lukehollis Can you suggest any modules I can try for the same ?

lukehollis commented 8 years ago

I think just using the CLTK core is where I would start by identifying entities and then we can test the results and refine on the frontend. It appears for the API we only need to expose the CLTK NER functionality with something like

GET /core/ner/<string:sentence>

manu-chroma commented 8 years ago

@lukehollis I've posted some implementation details on gitter channel of this repo. Could you follow up on that?

cltk / cltk_api

Add ability to identify and respond with a list of entities given an input string of a classical text #24