gdifazio / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

Ability to extract 'Name-like' strings from a comparatively larger string #52

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I'm not sure the extent to which this is possible, I've only looked into this 
kind of thing very briefly, 
but here goes.

Sometimes you'll have a blob of text with a few names in it, it'd be useful to 
be able to automatically 
extract these names using some kind of fancy natural language processing.

I don't anticipate this being a perfect system, but I imagine that once you've 
got a list of potential 
names from all your rows you could use the clustering tools to confirm/reject 
individual names.

Original issue reported on code.google.com by AndrewOf...@gmail.com on 20 May 2010 at 3:17

GoogleCodeExporter commented 9 years ago
Can you define "name-like" a little more rigorously? if a human can't 
understand what you want, it's really 
difficult to imagine a computer program doing better :-)

Original comment by stefano.mazzocchi@gmail.com on 20 May 2010 at 4:16

GoogleCodeExporter commented 9 years ago
I'll try, but I think this is the kind of thing where an existing algorithm 
would be needed or someone with 
natural language processing expertise.

A quick look around turned up this 
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.55.6337 

Basically a combination of spotting capitalised words and other keywords such 
as 'by', 'from', 'said' would be 
used to identify potential names, then the user would be responsible for 
accepting or rejecting specific 
results. Obviously the rules would vary from language to language, making 
things even more complicated. But 
English would be a good starting point.

I'm under no illusion that this is a small ask and I understand that this would 
need some pretty serious 
research. So I guess for now I'm just asking if in theory this kind of 
functionality matches the broader vision of 
Gridworks and is worth looking at in greater detail.

Original comment by AndrewOf...@gmail.com on 20 May 2010 at 4:34

GoogleCodeExporter commented 9 years ago
Let's turn this around: instead of describing the solution you think it's best 
for a problem we don't know, why 
don't you describe (in detail) your problem (maybe even show us a fragment of 
your data) and we can all 
brainstorm together about what's the best solution?

Original comment by stefano.mazzocchi@gmail.com on 20 May 2010 at 4:47

GoogleCodeExporter commented 9 years ago
Thad was asking for something similar, too. This is called Named Entity 
Extraction. I 
think the method used depends a lot on how long your text is. If it's a short 
snippet, e.g., "Secretary of State Hillary Clinton", it'd be hard to pick out 
the 
name alone. If it's at least one paragraph, capitalization detection almost 
works; or 
we can use Reuters' opencalais, but that requires a lot of HTTP calls.

Or we can use my poor man's NER:
http://d19.ner.dfhuynh.user.dev.freebaseapps.com/ner?
text=secretary%20of%20state%20Hillary%20Clinton%20visit

Original comment by dfhu...@gmail.com on 20 May 2010 at 5:29

GoogleCodeExporter commented 9 years ago
@dfhuynh: I've also been looking into this kind of thing lately. My use case is 
that
I have screen scraped thousands of articles from popular news sites, and now 
want to
know what the articles are about. So for a long article about gridworks I'd 
like to
get the keywords: "Gridworks", "Clustering algoritm", "Release 1". 

What's poor about your NER implementation?

Original comment by EmilStenstrom on 21 May 2010 at 6:20

GoogleCodeExporter commented 9 years ago
@stefano, here's an example of some data I've been playing with:

"Written by Geoff Johns & Peter J. Tomasi Art by Ivan Reis, Patrick Gleason, 
Ardian Syaf, Scott Clark & Joe Prado Cover by David Finch & 
Scott Williams 1:25 variant cover by Ivan Reis Deadman discovers the truth 
behind the formation of the White Lantern and what it 
means to the twelve returnees and the rest of the DC Universe. Plus, Aquaman, 
Martian Manhunter, Hawkman, Hawkgirl and Firestorm 
discover the price for their resurrections...and why they may be doing more 
harm than good to the world. Retailers please note: This 
issue will ship with two covers. Please see the Previews Order Form for more 
information. On sale JULY 21 32 pg, FC, $2.99 US'"

Ideally I'd like to be able to extract the names of the artists/writers as well 
as the names of the characters such as Aquaman.

Original comment by AndrewOf...@gmail.com on 21 May 2010 at 9:31

GoogleCodeExporter commented 9 years ago
Entity extraction seems orthogonal given that Gridworks is fundamentally a 
rectangular 
data thing, not a text thing.

Here's a recent overview of several of the available APIs which may be useful.
http://faganm.com/blog/2010/01/02/1009/
Be sure to check the comments for mentions of services that they overlooked.

Original comment by tfmorris on 21 May 2010 at 3:11

GoogleCodeExporter commented 9 years ago
@AndrewOfPie, thanks for the example.

While I agree with @tmorris that entity extraction is a different game from 
rectangular data operations (in fact, 
it's normally what you do to begin the structuring of unstructured text), it 
tickles me that Gridworks's contexts 
might, in fact, help out in entity extraction processes.

For example, if you can somehow specify that the names you're looking for are 
names of a particular Freebase 
type (here, Comic Book Author)... you can imagine being able to perform a 
simple NLP extraction of things 
that can be names plus a Freebase search to make sure they actually are. You 
might need to do multiple 
passes if you expect multiple types to be named in there, but it shouldn't be 
too bad to do that.

Now, does anybody know of a good NLP open source library (possibly in java?)

Original comment by stefano.mazzocchi@gmail.com on 21 May 2010 at 4:56

GoogleCodeExporter commented 9 years ago
@EmilStenstrom: it's poor because it only looks for capitalized words for name 
candidates. Then it searches Freebase for those names. This is the result for 
your 
sample paragraph:

http://tinyurl.com/296a3aw

Not very good, and hell slow.

@stefano: Stanford NLP Parser, but that's GPL. Grr. An extension framework 
would 
probably isolate the GPL, right? Besides, that parser is hefty and shouldn't be 
shipped with the core anyway.

Original comment by dfhu...@gmail.com on 21 May 2010 at 5:17

GoogleCodeExporter commented 9 years ago
@david Sounds like a great argument for a real extension framework ;-) The 
alternative is for us to provide a web 
service that wraps that... but the performance would be horrible and 
scalability of that service might be an issue.

Original comment by stefano.mazzocchi@gmail.com on 21 May 2010 at 5:30

GoogleCodeExporter commented 9 years ago
@stefano.mazzocchi: I've heard good things about NLTK (Natural Language 
ToolKit), but
never used it myself. Might be worth a look: http://code.google.com/p/nltk/

Original comment by EmilStenstrom on 21 May 2010 at 5:46

GoogleCodeExporter commented 9 years ago
> Now, does anybody know of a good NLP open source library (possibly in java?)

NLTK - Apache 2.0, my preferred NLP framework but it's python 
http://www.nltk.org/
UIMA - Apache 2.0, looks good and it's Java but I've not tried it before 
http://uima.apache.org/
GATE - LGPL, 
http://en.wikipedia.org/wiki/General_Architecture_for_Text_Engineering
RapidMiner has a text vector tool - Affero GPL, 
http://en.wikipedia.org/wiki/RapidMiner  

Original comment by iainsproat on 21 May 2010 at 5:58

GoogleCodeExporter commented 9 years ago
I think most of the Java-based NLP stuff plugs into UIMA (I'm pretty sure GATE 
does, 
for example).

Any type of entity extract will be greatly improved by domain knowledge (ie if 
you 
know it's comic book authors, illustrators and super heros).  I'm sure there 
are NLP 
researchers using Wikipedia and/or Freebase as a source for this type of domain 
knowledge, but I'm not sure how many of them are talking about it publicly.

For the comic book domain, I bet you if you dumped all comic book publishers, 
writers, illustrators, fictional universes, and characters, you'd end up with a 
pretty manageable sized dictionary that you could use for simple dictionary 
lookups 
of N-grams.

Still nothing to do with Gridworks though, in my opinion.

Original comment by tfmorris on 21 May 2010 at 6:10

GoogleCodeExporter commented 9 years ago
I believe the above feature would be best suited to an extension.

Original comment by iainsproat on 14 Oct 2010 at 9:32

GoogleCodeExporter commented 9 years ago
RelEx, a narrow-AI component of OpenCog, is an English-language semantic 
dependency relationship extractor, built on the Carnegie-Mellon Link Grammar 
parser. It can identify subject, object, indirect object and many other 
syntactic dependency relationships between words in a sentence.

http://wiki.opencog.org/w/RelEx_Semantic_Relationship_Extractor

Original comment by thadguidry on 8 Nov 2010 at 9:01

GoogleCodeExporter commented 9 years ago
I have a part of speech tagger (API) running on appengine using nltk's default 
tagger. Some notes here if anyone is interested
http://www.google.com/buzz/sharunsanthosh/9E7UfxdVqgx

Original comment by sharunsa...@gmail.com on 13 Dec 2010 at 5:10