Open GoogleCodeExporter opened 9 years ago
Can you define "name-like" a little more rigorously? if a human can't
understand what you want, it's really
difficult to imagine a computer program doing better :-)
Original comment by stefano.mazzocchi@gmail.com
on 20 May 2010 at 4:16
I'll try, but I think this is the kind of thing where an existing algorithm
would be needed or someone with
natural language processing expertise.
A quick look around turned up this
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.55.6337
Basically a combination of spotting capitalised words and other keywords such
as 'by', 'from', 'said' would be
used to identify potential names, then the user would be responsible for
accepting or rejecting specific
results. Obviously the rules would vary from language to language, making
things even more complicated. But
English would be a good starting point.
I'm under no illusion that this is a small ask and I understand that this would
need some pretty serious
research. So I guess for now I'm just asking if in theory this kind of
functionality matches the broader vision of
Gridworks and is worth looking at in greater detail.
Original comment by AndrewOf...@gmail.com
on 20 May 2010 at 4:34
Let's turn this around: instead of describing the solution you think it's best
for a problem we don't know, why
don't you describe (in detail) your problem (maybe even show us a fragment of
your data) and we can all
brainstorm together about what's the best solution?
Original comment by stefano.mazzocchi@gmail.com
on 20 May 2010 at 4:47
Thad was asking for something similar, too. This is called Named Entity
Extraction. I
think the method used depends a lot on how long your text is. If it's a short
snippet, e.g., "Secretary of State Hillary Clinton", it'd be hard to pick out
the
name alone. If it's at least one paragraph, capitalization detection almost
works; or
we can use Reuters' opencalais, but that requires a lot of HTTP calls.
Or we can use my poor man's NER:
http://d19.ner.dfhuynh.user.dev.freebaseapps.com/ner?
text=secretary%20of%20state%20Hillary%20Clinton%20visit
Original comment by dfhu...@gmail.com
on 20 May 2010 at 5:29
@dfhuynh: I've also been looking into this kind of thing lately. My use case is
that
I have screen scraped thousands of articles from popular news sites, and now
want to
know what the articles are about. So for a long article about gridworks I'd
like to
get the keywords: "Gridworks", "Clustering algoritm", "Release 1".
What's poor about your NER implementation?
Original comment by EmilStenstrom
on 21 May 2010 at 6:20
@stefano, here's an example of some data I've been playing with:
"Written by Geoff Johns & Peter J. Tomasi Art by Ivan Reis, Patrick Gleason,
Ardian Syaf, Scott Clark & Joe Prado Cover by David Finch &
Scott Williams 1:25 variant cover by Ivan Reis Deadman discovers the truth
behind the formation of the White Lantern and what it
means to the twelve returnees and the rest of the DC Universe. Plus, Aquaman,
Martian Manhunter, Hawkman, Hawkgirl and Firestorm
discover the price for their resurrections...and why they may be doing more
harm than good to the world. Retailers please note: This
issue will ship with two covers. Please see the Previews Order Form for more
information. On sale JULY 21 32 pg, FC, $2.99 US'"
Ideally I'd like to be able to extract the names of the artists/writers as well
as the names of the characters such as Aquaman.
Original comment by AndrewOf...@gmail.com
on 21 May 2010 at 9:31
Entity extraction seems orthogonal given that Gridworks is fundamentally a
rectangular
data thing, not a text thing.
Here's a recent overview of several of the available APIs which may be useful.
http://faganm.com/blog/2010/01/02/1009/
Be sure to check the comments for mentions of services that they overlooked.
Original comment by tfmorris
on 21 May 2010 at 3:11
@AndrewOfPie, thanks for the example.
While I agree with @tmorris that entity extraction is a different game from
rectangular data operations (in fact,
it's normally what you do to begin the structuring of unstructured text), it
tickles me that Gridworks's contexts
might, in fact, help out in entity extraction processes.
For example, if you can somehow specify that the names you're looking for are
names of a particular Freebase
type (here, Comic Book Author)... you can imagine being able to perform a
simple NLP extraction of things
that can be names plus a Freebase search to make sure they actually are. You
might need to do multiple
passes if you expect multiple types to be named in there, but it shouldn't be
too bad to do that.
Now, does anybody know of a good NLP open source library (possibly in java?)
Original comment by stefano.mazzocchi@gmail.com
on 21 May 2010 at 4:56
@EmilStenstrom: it's poor because it only looks for capitalized words for name
candidates. Then it searches Freebase for those names. This is the result for
your
sample paragraph:
http://tinyurl.com/296a3aw
Not very good, and hell slow.
@stefano: Stanford NLP Parser, but that's GPL. Grr. An extension framework
would
probably isolate the GPL, right? Besides, that parser is hefty and shouldn't be
shipped with the core anyway.
Original comment by dfhu...@gmail.com
on 21 May 2010 at 5:17
@david Sounds like a great argument for a real extension framework ;-) The
alternative is for us to provide a web
service that wraps that... but the performance would be horrible and
scalability of that service might be an issue.
Original comment by stefano.mazzocchi@gmail.com
on 21 May 2010 at 5:30
@stefano.mazzocchi: I've heard good things about NLTK (Natural Language
ToolKit), but
never used it myself. Might be worth a look: http://code.google.com/p/nltk/
Original comment by EmilStenstrom
on 21 May 2010 at 5:46
> Now, does anybody know of a good NLP open source library (possibly in java?)
NLTK - Apache 2.0, my preferred NLP framework but it's python
http://www.nltk.org/
UIMA - Apache 2.0, looks good and it's Java but I've not tried it before
http://uima.apache.org/
GATE - LGPL,
http://en.wikipedia.org/wiki/General_Architecture_for_Text_Engineering
RapidMiner has a text vector tool - Affero GPL,
http://en.wikipedia.org/wiki/RapidMiner
Original comment by iainsproat
on 21 May 2010 at 5:58
I think most of the Java-based NLP stuff plugs into UIMA (I'm pretty sure GATE
does,
for example).
Any type of entity extract will be greatly improved by domain knowledge (ie if
you
know it's comic book authors, illustrators and super heros). I'm sure there
are NLP
researchers using Wikipedia and/or Freebase as a source for this type of domain
knowledge, but I'm not sure how many of them are talking about it publicly.
For the comic book domain, I bet you if you dumped all comic book publishers,
writers, illustrators, fictional universes, and characters, you'd end up with a
pretty manageable sized dictionary that you could use for simple dictionary
lookups
of N-grams.
Still nothing to do with Gridworks though, in my opinion.
Original comment by tfmorris
on 21 May 2010 at 6:10
I believe the above feature would be best suited to an extension.
Original comment by iainsproat
on 14 Oct 2010 at 9:32
RelEx, a narrow-AI component of OpenCog, is an English-language semantic
dependency relationship extractor, built on the Carnegie-Mellon Link Grammar
parser. It can identify subject, object, indirect object and many other
syntactic dependency relationships between words in a sentence.
http://wiki.opencog.org/w/RelEx_Semantic_Relationship_Extractor
Original comment by thadguidry
on 8 Nov 2010 at 9:01
I have a part of speech tagger (API) running on appengine using nltk's default
tagger. Some notes here if anyone is interested
http://www.google.com/buzz/sharunsanthosh/9E7UfxdVqgx
Original comment by sharunsa...@gmail.com
on 13 Dec 2010 at 5:10
Original issue reported on code.google.com by
AndrewOf...@gmail.com
on 20 May 2010 at 3:17