Add Levenstein distance enhancements

kylepjohnson commented 8 years ago

We have a good implementation now by @lukehollis . Now just a matter of thinking a little more about interface. The following, from my comment to Luke's PR, are things for us to think about for next iteration:

Define language in class declaration (?)
Add a function which will process just strings w/o tokenization
Think about optional lemmatization/stem which would happen after sentence splitting but before fuzzywuzzy.
Add fuzzywuzzy to setup.py as a project dependency; python-Levenshtein, too
Add module to docs
I'm OK with reuse for now; maybe we can come up with another dir or name (no idea what that would be)
Could we offer a language-agnostic version of this? Would that be useful to people? For example, we don't have a Pali sentence tokenizer, however someone still might want to look for similar sentences in a Pali corpus.
Write tests for each function

kylepjohnson commented 8 years ago

Two thoughts about Greek to add to this:

This is motivation for me to take care of #94 . This needs to be done b/c combined and pre-combined characters (I assume) won't match.
Think about an optional diacritic-stripping function for non-lemmatized/stemmed words. This could enhance similarity scores for words of the same stem or headword.

lukehollis commented 8 years ago

Great thoughts here--this is sounds like it has a lot of potential. I can add to the docs and write tests.

Yes to the language agnostic version: I think a good example of this is the people that are wanting to compare Latin and Greek.

I was dreaming at one point about converting Latin and Greek into some kind of phonetic metalanguage but settled for trying romanize: https://github.com/gschizas/RomanizePython/blob/master/romanize/__init__.py There are a lot of good resources here, but it wasn't quite what was needed:

from romanize import romanize
from fuzzywuzzy import fuzz
latin = "Pleiadas, Hyadas, claramque Lycaonis Arcton"
greek = "Πληιάδες θ᾽ Ὑάδες τε τό τε σθένος Ὠαρίωνος"
romanized = romanize(greek)
# >>> 'Pliiades th᾽ Ὑades te to te sthenos Ὠarionos'
fuzz.ratio(latin, romanized)
# >>> 37

Of all the varied types of intertextuality, three stuck out to me: phonetic, semantic, and thematic. I was mostly influenced at this point by Richard Thomas's book "Reading Vergil and his Texts" and Joseph Farrell's "Vergil's Georgics and Traditions of the Ancient Epic".

It seemed that fuzzy string matching w/Levenshtein distance was the best (though far from perfect) match to some combination of semantic and phonetic comparison. I don't have either of those books with me right now, but if I dig them out of the PAs again, I'll look for good examples of where Vergil and Hesiod both used interesting alliteration or lines sounded very close even if much of the time they weren't directly using the same words.

Otherwise, I'd estimate another next step is to lemmatize or stem words before comparison.

kylepjohnson commented 8 years ago

So much here to discuss.

The transliteration might have value here (though the one linked to seems to be doing it phonetically from Modern Greek). Hadn't considered this before, but we could definitely catch proper nouns, which folks would find really useful. Clever.
@diyclassics wrote some good code on alliteration for an ACA talk he gave a few years ago. Patrick might have been targeting a different use case. I'd be interested to hear his thoughts on the matter.
On semantic comparisons: Word2Vec stands out to me for capturing similar words/phrases/sentences. One (maybe crazy) way could be to add the vectors of a sentence, then compare these sums to one another those of all other available sentences of a corpus.
On thematics, this is a perfect job for topic modeling (my experience with LDA, Gensim and scikit, and NMF). Lots of gotchas in here, but the core goal would be several pre-trained models. The different models would have different #s of topics (say, 10, 20, 50, 100); then run text by author, text, and sentence. Tf-idf similarities have a place here, need to learn more though (some old code).

This is all rather hand-wavy, I know. Just thinking out loud: it'd be a lot of fun to share these with Thomas and Farrell once we have a few more working prototypes.

lukehollis commented 8 years ago

Continued this in https://github.com/cltk/cltk/pull/147. I moved the sentence and sliding window methods to a separate TextReuse class where we might implement different string comparison methods than a Levenshtein distance calculation in the future. Is the Levenshtein class worth keeping? I'm torn. It only wraps fuzzywuzzy's ratio method (which wraps the same from python-Levenshtein if I remember correctly). Right now, the Levenshtein.ratio method is language agnostic.

I think at some point it will definitely be significant to flag proper nouns and treat these differently. Great idea. I guess that bleeds into the named entity recognition ideas we've talked about elsewhere.

Still no cross language capabilities yet.

The alliteration code looks great, @diyclassics. Any chance that some amount of that might be implemented for studying text reuse across classical authors? Would love to discuss this more.

kylepjohnson commented 8 years ago

Small update: I moved Luke's docs to the Multilingual section of the docs (commit here. (Note: If you want a new top-level page in the docs, you have you add the page name to index.rst.)

@lukehollis update the docs here as you like. Or I can too sometime.

Is the Levenshtein class worth keeping? I'm torn. It only wraps fuzzywuzzy's ratio method (which wraps the same from python-Levenshtein if I remember correctly). Right now, the Levenshtein.ratio method is language agnostic.

Just thinking outloud …In the name of lowering the bar to NLP for Classicists, I think we would do well to keep, at least for now. And since you're anticipating using algos other than Levenshtein, this means we'll need an abstract "ratio" (or sim.) function, to which you could pass the algo as a parameter.

DrBronzeAge commented 8 years ago

This looks really cool, and a lot more stream-lined than what I built. Two enhancements that I might suggest, based on doing this for my own research and doing it freelance for others:

1) Adding a way to easily inspect possible matches. No one is going to publish research that is only based the output of an algorithm, they'll want to verify it by hand. Making that process as painless as possible is important. I did it by writing out an html table [Sentence 1, Sentence 2, Words in common] and bolding the commonalities in S1 and S2. It makes it easy to evaluate quickly and at a glance.

2) You might want to add some way to visualize the distribution of intertextualities. That was the ask I got when I offered to help colleagues with this. A good visualization makes it easier for them to interpret and easier for them to explain their work to others.

kylepjohnson commented 8 years ago

@DrBronzeAge Your ideas are very welcome here.

re 1): I love it. In the name of simplicity, I recommend adding a flag csv_output=True/False

re 2): So far, CLTK has shied away from visual output. But I would be interested in hearing from you what a (say) .csv export would need to contain to enable good data inspection. We can talk about another CLTK sub-project, in its own repo, which allows for inspecting data visually.

Thanks again, please help us keep up the momentum on this. Is this the repo you did this work in? https://github.com/DrBronzeAge/LatinIntertextFinder_Alpha

lukehollis commented 8 years ago

Great feedback, thanks! I think exporting a CSV file seems like a good lowest common denominator that we can rely on for saving comparisons. I wonder if we could include the totals for comparisons with a high similarity ratio for each author involved--unsure how much should go into the module v. rely on the developer using the CLTK to write in his/her scripts.

For a first step to adding the csv_output flag, I think we should give the programmer the opportunity to add a little metadata about the input strings (string_a is Homer, Iliad, Book 1; string_b is Hesiod, Theogony, etc). Should we add this at the method level or class level? I vote method but can see both ways.

Otherwise, if we wanted to use the tools that we're building ourselves, I think visualizing comparison data from documents in the CLTK repos might be a really interesting project to include with the Meteor frontend. I think a chord diagram from d3 might be appropriate to this task--something like http://bl.ocks.org/mbostock/1046712 We wouldn't even need that good of metadata for comparing documents--we could work simply at author to author level..? @DrBronzeAge, would you be able to add the HTML visualization to a page that we create at https://github.com/cltk/cltk_frontend if we had our own basic dataset to experiment with? Also, thoughts on a group of a few Latin authors that we might start with? Eventually, it'd be great to extend this graph to include the authors of all the documents available through the frontend reading interface.

diyclassics commented 8 years ago

Luke, can’t wait to get a chance to experiment with this.

As for Latin authors, you might consider running some benchmark tests with Virgil and Lucan (esp. book 1)—this is the data that the Tesserae project (http://tesserae.caset.buffalo.edu http://tesserae.caset.buffalo.edu/) has used for its tests. It could be useful to have comparative data from a project with similar aims. I can email you a list of articles if you’re interested.—PJB

On Mar 3, 2016, at 12:02 AM, Luke Hollis notifications@github.com wrote:

Also, thoughts on a group of a few Latin authors that we might start with? Eventually, it'd be great to extend this graph to include the authors of all the documents available throug h the fr ontend reading interface.

lukehollis commented 8 years ago

Pat, that sounds great! Would love the list of articles. Here's what I've got so far: https://github.com/lukehollis/augustan-era-intertext/blob/master/intertext.py

This is somewhat how I imagine the text_reuse module might be used by a programmer interested in studying intertextuality between classical authors. In the Intertext class, I load a bunch of documents from a few authors and then make a new instantiation of the TextReuse class for each document-to-document comparison. Then I save the resulting comparison objects into a mongo collection so I can query them later--so long as the comparison has a levenshtein distance ratio above 0.6; in the past when I tried saving everything, I generated tons of data that slowed down my queries after running it. 0.6 is an arbitrary distinction--and should be lower when we figure out how to better compare across languages.

Otherwise, since string comparison is CPU-intensive, I imagine programmers being much happier with using simple multiprocessing. I'll let the intertext module I linked above as is run overnight and post results when it finishes.

For the text_reuse module, how much should we include in cltk core for data export / visualization versus relying on programmers to write their own scripts? I can see a lot of possibilities both ways.

kylepjohnson commented 8 years ago

Haven't followed all the details of this thread, but chiming in on one point:

For the text_reuse module, how much should we include in cltk core for data export / visualization versus relying on programmers to write their own scripts? I can see a lot of possibilities both ways.

I am not dogmatic about it, but I am hesitant to put visualization code in the core software. In the website, I'll leave it for Luke to judge. My only word of advice would be to try one's best to make this flexible for other corpora and languages.

DrBronzeAge commented 8 years ago

Sorry I dropped out of sight there for the weekend. Going back up the chain: @kylepjohnson that (https://github.com/DrBronzeAge/LatinIntertextFinder_Alpha) is the repo where I dumped this stuff, but it really was just a code dump. More notebook than module. That project went in multiple directions, and the intertext finder was only one part of it. I'm still cleaning it up and refactoring it to be half-way re-usable, but it probably won't make anyone's eyes bleed now.

@lukehollis As for the visualization stuff, I'm not entirely sure I understand how the chord diagram would work in this case. Would we be trying to follow a phase/phrases through a host of different authors and works?

A quick googling shows me that there is at least one implementation of hive plots for d3. They hew a little closer to how I think about these questions (we read things from beginning to end, so arranging sentences on a linear axis makes a lot of sense).

I've never used meteor, (nor even looked at the CLTK front end, if I'm being honest) but I could make a d3 template that will take an edge list of the kind I made for R and spit out a nice-ish hive plot, if that's what you're asking.

lukehollis commented 8 years ago

@DrBronzeAge, wow, that looks excellent! I'll check out the data format included on that example and get some example data looking like that. Do you have any authors you're interested in that I should include? Our text_reuse module is running much faster (~300x) thanks to @ferthalangur, so we can start adding authors iteratively here without too much trouble.

kylepjohnson commented 8 years ago

@ferthalangur 300x! :open_mouth:

lukehollis commented 8 years ago

Should've been 300%--my bad!

ferthalangur commented 8 years ago

I thought 300x would look a lot better on my annual performance review. :)

kylepjohnson commented 8 years ago

I think we've moved on from this issue, however we aren't liable of forgetting the concept, so I'm going to close this for now. Please re-open or make a new, more specific ticket as you see fit.

cltk / cltk

Add Levenstein distance enhancements #130