Henni commented 7 years ago

We need to evaluate which relationship extraction algorithm we want to use.

Preferably it should work with given entities and also generate a rank for the relationship.

Let's start with gathering possible algorithms and afterwards evaluating which one we should go with.

Henni commented 7 years ago

Algorithms I found so far:

DIPRE

Dual Iterative Pattern Relation Expansion Paper: Sergey Brin

Snowball

Paper: Eugene Agichtein, Luis Gravano based on DIPRE looks like it doesn't generate any ranking Python implementation (maybe interesting to compare our approach or generate testdata)

Other stuff to review

Distant Supervision Approach, DIRT (Lin & Pantel 2003), TextRunner (Banko et al. 2007) Demo: http://openie.allenai.org/, "Improved Dirt" (Yao et al. 2012)

Mostly taken from: http://courses.cs.washington.edu/courses/cse517/13wi/slides/cse517wi13-RelationExtraction.pdf http://courses.cs.washington.edu/courses/cse517/13wi/slides/cse517wi13-RelationExtractionII.pdf

sacdallago commented 7 years ago

@vviro thoughts on this?

vviro commented 7 years ago

I would also check out these projects:

http://reverb.cs.washington.edu/ http://nlp.stanford.edu/software/relationExtractor.html https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/patty/ https://github.com/U-Alberta/exemplar https://github.com/patverga/torch-relation-extraction

... and probably more interesting repos on GitHub. It doesn't have to be just one algorithm from the start - could be worth trying out several and see if good results are coming out of any of them.

pfent commented 7 years ago

You might just use similarity / distance measures to get started with a baseline…

Afterwards you can maybe experiment with more advanced algorithms and compare them against the baseline

sacdallago commented 7 years ago

If you represent the features as numerical vectors in js I might even have an out of the box solution for simple distance and similarity metrics (and I happen to know the developer of the tool :D :D )

https://www.npmjs.com/package/disi + https://github.com/sacdallago/disi

Henni commented 7 years ago

@RBirkeland do you have any news on this? @krishenk have you also worked on this task?

Henni commented 7 years ago

Ollie Information Extraction: https://homes.cs.washington.edu/~mausam/papers/emnlp12a.pdf

ansjin commented 7 years ago

Ollie Extraction Algorithm:

https://github.com/knowitall/ollie

Sentence: Johann Sebastian Bach was born in Eisenach, the capital of the duchy of Saxe-Eisenach, in present-day Germany, on 21 March 1685 O.S. (31 March 1685 N.S.)

Output: Johann Sebastian Bach was born in Eisenach, the capital of the duchy of Saxe-Eisenach, in present-day Germany, on 21 March 1685 O.S. (31 March 1685 N.S.) 0.9: (Johann Sebastian Bach; was born in; Eisenach) 0.832: (Johann Sebastian Bach; was born in; present-day Germany) 0.794: (Johann Sebastian Bach; was born on; 21) 0.575: (Eisenach; be the capital of; the duchy of Saxe-Eisenach) 0.555: (Johann Sebastian Bach; was born at; Eisenach) 0.541: (Eisenach; was born in; the capital of the duchy of Saxe-Eisenach) 0.407: (Johann Sebastian Bach; was born at; present-day Germany)

MITIE Algorithm:

https://github.com/mit-nlp/MITIE

Sentence: Johann Sebastian Bach was born in Eisenach, the capital of the duchy of Saxe-Eisenach, in present-day Germany, on 21 March 1685 O.S. (31 March 1685 N.S.)

Output: Tags output by this NER model are: PERSON LOCATION ORGANIZATION MISC Number of entities found: 6 Entity tag: PERSON Entity text: Johann Sebastian Bach Entity tag: LOCATION Entity text: Eisenach Entity tag: LOCATION Entity text: Saxe-Eisenach Entity tag: LOCATION Entity text: Germany Entity tag: LOCATION Entity text: O.S. Entity tag: LOCATION Entity text: N.S. Relation detector type: people.person.place_of_birth Johann Sebastian Bach BORN IN Eisenach

IEPY

https://github.com/machinalis/iepy

Not yet tested

RBirkeland commented 7 years ago

Semantic similarity

Building on @ansjin we will need to find the semantic similarity of relationships, f.ex: composed and wrote, played and performed.

Clustering of semantically similar words can be solved by brown's clustering, but luckily his theory has already been developed in WordNet.

Web interface to play around with WordNet: Link WordNet can be downloaded here: Link

In my testing Lin's relatedness measurements seems to work best and here is some of the results. composed & wrote = 1 composed & traveled = 0.25 played & performed = 0.966 went & travelled = 1 teach & learn = 1 teach & fruit = 0.2

Henni commented 7 years ago

@RBirkeland I still see a few issues with WordNet we have to think about:

We have to detect, that 'teach' and 'learn' have something to do with each other but are contrary in their meaning.
Also my quick experiment showed that while teach&learn=1 and learn&study=1, teach&study=0.2

RBirkeland commented 7 years ago

@Henni Ye, this is a weakness with WordNet because it bases its estimate on the training set, so therefor it will also not be able to rank completely dissimilar words such as learn & car, however this should be fine as long as we classify null as 0.

krishenk commented 7 years ago

@Henni and @RBirkeland : I also tried teach and study, and got the same result. I think we need to have a way to include prepositions in relation, like studied under or taught by (to include the direction of relationship in some sense), as taught by and taught mean just the opposite.

ansjin commented 7 years ago

I ran OPEN IE(https://github.com/allenai/openie-standalone) algorithm and results seems to be quite interesting and a little bit matching to what we manually did today. Check the following sentences and there results:

His father probably taught him to play the violin and harpsichord, and his brother Johann Christoph Bach taught him the clavichord and exposed him to much contemporary music

Instance: 0.49 (His father; probably taught; him; to play the violin and harpsichord)
Instance: 0.45 (him; to play; the violin and harpsichord)
Instance: 0.45 (his brother; taught; him; the clavichord)
Instance: 0.45 (his brother; exposed; him; to much contemporary music)
Instance: 0.39 (Johann Christoph Bach; [is] brother [of]; him)

In January 1703, shortly after graduating from St. Michael's and being turned down for the post of organist at Sangerhausen,[21] Bach was appointed court musician in the chapel of Duke Johann Ernst III in Weimar


Instance: 0.95 (Bach; was appointed; court musician in the chapel of Duke Johann Ernst in Weimar; L:In January 1703, shortly after graduating from St. Michael's and being turned down for the post of organist at Sangerhausen)
Instance: 0.39 (Sangerhausen; [is] Bach was appointed court musician in; the chapel)

Mozart's most famous pupil, whom the Mozarts took into their Vienna home for two years as a child, was probably Johann Nepomuk Hummel, a transitional figure between Classical and Romantic eras.


Instance: 0.90 (Mozart's most famous pupil; took; into their Vienna home; T:for two years as a child)
Instance: 0.94 (Mozart's most famous pupil; was probably; Johann Nepomuk Hummel)
Instance: 0.38 (Johann Nepomuk Hummel; [is] a transitional figure between; Classical)

In the course of 1782 and 1783, Mozart became intimately acquainted with the work of Johann Sebastian Bach and George Frideric Handel as a result of the influence of Gottfried van Swieten, who owned many manuscripts of the Baroque masters.

Instance: 0.98 (Mozart; became; intimately acquainted with the work of Johann Sebastian Bach and George Frideric Handel as a result of the influence of Gottfried van Swieten; T:In the course of 1782 and 1783)
Instance: 0.95 (Mozart; intimately acquainted; with the work of Johann Sebastian Bach and George Frideric Handel as a result of the influence of Gottfried van Swieten)
Instance: 0.93 (Gottfried van Swieten; owned; many manuscripts of the Baroque masters)

Wagner wrote both the libretto and the music for each of his stage works.

Instance: 0.91 (Wagner; wrote; both the libretto and the music for each of his stage works)

At the end, critical reactions ranged between that of the Norwegian composer Edvard Grieg, who thought the work was divinely composed, and that of the French newspaper Le Figaro, which called the music the dream of a lunatic.


Instance: 0.92 (critical reactions; ranged; between that of the Norwegian composer Edvard Grieg, who thought the work was divinely composed, and that of the French newspaper; T:At the end)
Instance: 0.90 (the French newspaper; called; the music; the dream of a lunatic)
Instance: 0.87 (the Norwegian composer; thought; the work was divinely composed)
Instance: 0.68 Context(the Norwegian composer thought):(the work; was divinely composed; )
Instance: 0.39 (Edvard Grieg; [is] composer [from]; Norway)

At the age of 21 he moved to Vienna, where he began studying composition with Joseph Haydn, and gained a reputation as a virtuoso pianist

Instance: 0.64 (he; moved; to Vienna; T:At the age of 21)
Instance: 0.55 (he; gained; a reputation as a virtuoso pianist; T:At the age of 21)
Instance: 0.43 (he; began; studying composition with Joseph Haydn)
Instance: 0.39 Context(he began):(he; began studying; composition)

Cons:

Takes a lot of memory.

Note: This is successor of Ollie Algorithm

Henni commented 7 years ago

@ansjin @RBirkeland @krishenk so should we go with OPEN IE then? Is this feasible to implement in your opinion?

I would also love to hear @vviro's thoughts on this decision.

vviro commented 7 years ago

Would it be possible to run at least a small benchmark on more than a couple of sentences and compare the run times and the memory footprint of the processes? We don't know yet how many sentences we are going to end up processing, but knowing how the algorithms perform on, say, 1000 or 10000 sentences would give us a better idea of the computational requirements and the feasibility of the various options. That being said, OPEN IE results look quite nice.

ansjin commented 7 years ago

I will try to run it on large set of sentences and get the run time and memory foot prints!

sacdallago commented 7 years ago

Cool!

ansjin commented 7 years ago

I ran the algorithm for different number of sentences. Follwing are the results:

Number of Sentences: 619

Used Memory:  2049
Free Memory:  468
Total Memory: 2518
Max Memory:   2731
Total Time: 132 s

Number of Sentences: 1619

Used Memory:  2409
Free Memory:  534
Total Memory: 2944
Max Memory:   2944
Total Time: 219 s

Number of Sentences: 11327

Used Memory:  2470
Free Memory:  473
Total Memory: 2944
Max Memory:   2944
Total Time: 1308 s

Test Environment

Sentences Used Complete Wiki page of Johann_Sebastian_Bach(https://en.wikipedia.org/wiki/Johann_Sebastian_Bach) and Wolfgang Amadeus Mozart(https://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart) with some parsing(converting the texts to sentences and Unicode encoding).

Note: As the complete wiki page was used so some of the sentences were also the reference links and some useless stuff.

Laptop Config Intel Core i5 @2.67GHz RAM: 8GB

Note: No parallelism was involved (single thread, single core)

We have to check how fast it can run by running the tasks parallel.

Attached files of relationships. result_open_ie_619_sentences.txt result_open_ie_1619sentences.txt result_open_ie_11327_sentences.txt

kordianbruck commented 7 years ago

2GBs of memory for one operation and it took a couple of minutes to compute? Am I seeing this correctly :question: :grey_question: :question:

sacdallago commented 7 years ago

I'm also super confused by the numbers here. How was that extracted?

@kordianbruck why you so much dislike my cool comment 😢

kordianbruck commented 7 years ago

@sacdallago use the emojis. "cool" is like posting "+1" :stuck_out_tongue_closed_eyes:

sacdallago commented 7 years ago

+1

AHAHAHA

Sorry for spamming -- we can go back to being serious

RBirkeland commented 7 years ago

@krishenk With your experience in scala, can we possible run OpenIE on spark natively? This might be able to improve the memory bottleneck?

krishenk commented 7 years ago

@RBirkeland I will take a look at it.

ansjin commented 7 years ago

Actually that algorithm is using Scala, so the memory used is the complete JVM heap (which I restricted to 3GB) and its for the complete process. I used runtime library of scala to figure out the memory.

This algorithm uses a lot of memory as also mentioned on their github page.

Henni commented 7 years ago

I guess we can close this as we have decided to use OpenIE and Ollie for now. This might still be a good resource for future extension.

Implementation continues in #18

MusicConnectionMachine / RelationshipsG3

Choose Relationship Extraction Algorithm #3

DIPRE

Snowball

Other stuff to review

Ollie Extraction Algorithm:

MITIE Algorithm:

IEPY

Semantic similarity

Number of Sentences: 619

Number of Sentences: 1619

Number of Sentences: 11327

Test Environment