Closed Henni closed 7 years ago
Algorithms I found so far:
Dual Iterative Pattern Relation Expansion Paper: Sergey Brin
Paper: Eugene Agichtein, Luis Gravano based on DIPRE looks like it doesn't generate any ranking Python implementation (maybe interesting to compare our approach or generate testdata)
Distant Supervision Approach, DIRT (Lin & Pantel 2003), TextRunner (Banko et al. 2007) Demo: http://openie.allenai.org/, "Improved Dirt" (Yao et al. 2012)
Mostly taken from: http://courses.cs.washington.edu/courses/cse517/13wi/slides/cse517wi13-RelationExtraction.pdf http://courses.cs.washington.edu/courses/cse517/13wi/slides/cse517wi13-RelationExtractionII.pdf
@vviro thoughts on this?
I would also check out these projects:
http://reverb.cs.washington.edu/ http://nlp.stanford.edu/software/relationExtractor.html https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/patty/ https://github.com/U-Alberta/exemplar https://github.com/patverga/torch-relation-extraction
... and probably more interesting repos on GitHub. It doesn't have to be just one algorithm from the start - could be worth trying out several and see if good results are coming out of any of them.
You might just use similarity / distance measures to get started with a baseline…
Afterwards you can maybe experiment with more advanced algorithms and compare them against the baseline
If you represent the features as numerical vectors in js I might even have an out of the box solution for simple distance and similarity metrics (and I happen to know the developer of the tool :D :D )
https://www.npmjs.com/package/disi + https://github.com/sacdallago/disi
@RBirkeland do you have any news on this? @krishenk have you also worked on this task?
Ollie Information Extraction: https://homes.cs.washington.edu/~mausam/papers/emnlp12a.pdf
https://github.com/knowitall/ollie
Sentence: Johann Sebastian Bach was born in Eisenach, the capital of the duchy of Saxe-Eisenach, in present-day Germany, on 21 March 1685 O.S. (31 March 1685 N.S.)
Output: Johann Sebastian Bach was born in Eisenach, the capital of the duchy of Saxe-Eisenach, in present-day Germany, on 21 March 1685 O.S. (31 March 1685 N.S.) 0.9: (Johann Sebastian Bach; was born in; Eisenach) 0.832: (Johann Sebastian Bach; was born in; present-day Germany) 0.794: (Johann Sebastian Bach; was born on; 21) 0.575: (Eisenach; be the capital of; the duchy of Saxe-Eisenach) 0.555: (Johann Sebastian Bach; was born at; Eisenach) 0.541: (Eisenach; was born in; the capital of the duchy of Saxe-Eisenach) 0.407: (Johann Sebastian Bach; was born at; present-day Germany)
https://github.com/mit-nlp/MITIE
Sentence: Johann Sebastian Bach was born in Eisenach, the capital of the duchy of Saxe-Eisenach, in present-day Germany, on 21 March 1685 O.S. (31 March 1685 N.S.)
Output: Tags output by this NER model are: PERSON LOCATION ORGANIZATION MISC Number of entities found: 6 Entity tag: PERSON Entity text: Johann Sebastian Bach Entity tag: LOCATION Entity text: Eisenach Entity tag: LOCATION Entity text: Saxe-Eisenach Entity tag: LOCATION Entity text: Germany Entity tag: LOCATION Entity text: O.S. Entity tag: LOCATION Entity text: N.S. Relation detector type: people.person.place_of_birth Johann Sebastian Bach BORN IN Eisenach
https://github.com/machinalis/iepy
Not yet tested
Building on @ansjin we will need to find the semantic similarity of relationships, f.ex: composed and wrote, played and performed.
Clustering of semantically similar words can be solved by brown's clustering, but luckily his theory has already been developed in WordNet.
Web interface to play around with WordNet: Link WordNet can be downloaded here: Link
In my testing Lin's relatedness measurements seems to work best and here is some of the results. composed & wrote = 1 composed & traveled = 0.25 played & performed = 0.966 went & travelled = 1 teach & learn = 1 teach & fruit = 0.2
@RBirkeland I still see a few issues with WordNet we have to think about:
@Henni Ye, this is a weakness with WordNet because it bases its estimate on the training set, so therefor it will also not be able to rank completely dissimilar words such as learn & car, however this should be fine as long as we classify null as 0.
@Henni and @RBirkeland : I also tried teach and study, and got the same result. I think we need to have a way to include prepositions in relation, like studied under or taught by (to include the direction of relationship in some sense), as taught by and taught mean just the opposite.
I ran OPEN IE(https://github.com/allenai/openie-standalone) algorithm and results seems to be quite interesting and a little bit matching to what we manually did today. Check the following sentences and there results:
His father probably taught him to play the violin and harpsichord, and his brother Johann Christoph Bach taught him the clavichord and exposed him to much contemporary music
Instance: 0.49 (His father; probably taught; him; to play the violin and harpsichord)
Instance: 0.45 (him; to play; the violin and harpsichord)
Instance: 0.45 (his brother; taught; him; the clavichord)
Instance: 0.45 (his brother; exposed; him; to much contemporary music)
Instance: 0.39 (Johann Christoph Bach; [is] brother [of]; him)
In January 1703, shortly after graduating from St. Michael's and being turned down for the post of organist at Sangerhausen,[21] Bach was appointed court musician in the chapel of Duke Johann Ernst III in Weimar
Instance: 0.95 (Bach; was appointed; court musician in the chapel of Duke Johann Ernst in Weimar; L:In January 1703, shortly after graduating from St. Michael's and being turned down for the post of organist at Sangerhausen)
Instance: 0.39 (Sangerhausen; [is] Bach was appointed court musician in; the chapel)
Mozart's most famous pupil, whom the Mozarts took into their Vienna home for two years as a child, was probably Johann Nepomuk Hummel, a transitional figure between Classical and Romantic eras.
Instance: 0.90 (Mozart's most famous pupil; took; into their Vienna home; T:for two years as a child)
Instance: 0.94 (Mozart's most famous pupil; was probably; Johann Nepomuk Hummel)
Instance: 0.38 (Johann Nepomuk Hummel; [is] a transitional figure between; Classical)
In the course of 1782 and 1783, Mozart became intimately acquainted with the work of Johann Sebastian Bach and George Frideric Handel as a result of the influence of Gottfried van Swieten, who owned many manuscripts of the Baroque masters.
Instance: 0.98 (Mozart; became; intimately acquainted with the work of Johann Sebastian Bach and George Frideric Handel as a result of the influence of Gottfried van Swieten; T:In the course of 1782 and 1783)
Instance: 0.95 (Mozart; intimately acquainted; with the work of Johann Sebastian Bach and George Frideric Handel as a result of the influence of Gottfried van Swieten)
Instance: 0.93 (Gottfried van Swieten; owned; many manuscripts of the Baroque masters)
Wagner wrote both the libretto and the music for each of his stage works.
Instance: 0.91 (Wagner; wrote; both the libretto and the music for each of his stage works)
At the end, critical reactions ranged between that of the Norwegian composer Edvard Grieg, who thought the work was divinely composed, and that of the French newspaper Le Figaro, which called the music the dream of a lunatic.
Instance: 0.92 (critical reactions; ranged; between that of the Norwegian composer Edvard Grieg, who thought the work was divinely composed, and that of the French newspaper; T:At the end)
Instance: 0.90 (the French newspaper; called; the music; the dream of a lunatic)
Instance: 0.87 (the Norwegian composer; thought; the work was divinely composed)
Instance: 0.68 Context(the Norwegian composer thought):(the work; was divinely composed; )
Instance: 0.39 (Edvard Grieg; [is] composer [from]; Norway)
At the age of 21 he moved to Vienna, where he began studying composition with Joseph Haydn, and gained a reputation as a virtuoso pianist
Instance: 0.64 (he; moved; to Vienna; T:At the age of 21)
Instance: 0.55 (he; gained; a reputation as a virtuoso pianist; T:At the age of 21)
Instance: 0.43 (he; began; studying composition with Joseph Haydn)
Instance: 0.39 Context(he began):(he; began studying; composition)
Cons:
Note: This is successor of Ollie Algorithm
@ansjin @RBirkeland @krishenk so should we go with OPEN IE then? Is this feasible to implement in your opinion?
I would also love to hear @vviro's thoughts on this decision.
Would it be possible to run at least a small benchmark on more than a couple of sentences and compare the run times and the memory footprint of the processes? We don't know yet how many sentences we are going to end up processing, but knowing how the algorithms perform on, say, 1000 or 10000 sentences would give us a better idea of the computational requirements and the feasibility of the various options. That being said, OPEN IE results look quite nice.
I will try to run it on large set of sentences and get the run time and memory foot prints!
Cool!
I ran the algorithm for different number of sentences. Follwing are the results:
Used Memory: 2049
Free Memory: 468
Total Memory: 2518
Max Memory: 2731
Total Time: 132 s
Used Memory: 2409
Free Memory: 534
Total Memory: 2944
Max Memory: 2944
Total Time: 219 s
Used Memory: 2470
Free Memory: 473
Total Memory: 2944
Max Memory: 2944
Total Time: 1308 s
Sentences Used Complete Wiki page of Johann_Sebastian_Bach(https://en.wikipedia.org/wiki/Johann_Sebastian_Bach) and Wolfgang Amadeus Mozart(https://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart) with some parsing(converting the texts to sentences and Unicode encoding).
Note: As the complete wiki page was used so some of the sentences were also the reference links and some useless stuff.
Laptop Config Intel Core i5 @2.67GHz RAM: 8GB
Note: No parallelism was involved (single thread, single core)
We have to check how fast it can run by running the tasks parallel.
Attached files of relationships. result_open_ie_619_sentences.txt result_open_ie_1619sentences.txt result_open_ie_11327_sentences.txt
2GBs of memory for one operation and it took a couple of minutes to compute? Am I seeing this correctly :question: :grey_question: :question:
I'm also super confused by the numbers here. How was that extracted?
@kordianbruck why you so much dislike my cool comment 😢
@sacdallago use the emojis. "cool" is like posting "+1" :stuck_out_tongue_closed_eyes:
+1
AHAHAHA
Sorry for spamming -- we can go back to being serious
@krishenk With your experience in scala, can we possible run OpenIE on spark natively? This might be able to improve the memory bottleneck?
@RBirkeland I will take a look at it.
Actually that algorithm is using Scala, so the memory used is the complete JVM heap (which I restricted to 3GB) and its for the complete process. I used runtime library of scala to figure out the memory.
This algorithm uses a lot of memory as also mentioned on their github page.
I guess we can close this as we have decided to use OpenIE and Ollie for now. This might still be a good resource for future extension.
Implementation continues in #18
We need to evaluate which relationship extraction algorithm we want to use.
Preferably it should work with given entities and also generate a rank for the relationship.
Let's start with gathering possible algorithms and afterwards evaluating which one we should go with.