chnm / serendipomatic

http://serendipomatic.org/
26 stars 9 forks source link

Dealing with stop words and NER in multilingual texts #114

Open mialondon opened 11 years ago

mialondon commented 11 years ago

Is the current workflow: 'detect language, apply appropriate stopwords' or 'apply generic multilingual stopwords'? If it's the former, can we detect multiple languages and apply the appropriate lists of stopwords?

As this conversation hints, many scholars work in two or more languages https://twitter.com/wilkohardenberg/status/363677752391516161 so ideally we could cope with returning entities and tokens for at least two languages and also apply stop words.

The trickiness of dealing with this might also be a call for more randomness in the way query terms are mixed so people can refresh the results and see different terms applied.

rlskoeser commented 11 years ago

Current workflow is to detect language using python guess-language and then select appropriate stopwords if it's a language nltk has stopwords for. I hadn't thought about mixed languages, though. Might be helpful to have some sample mixed language text so we can see what guess-language thinks of it, write some tests.

wilkohardenberg commented 11 years ago

Here is some text that hugely confuses the guess-language function:

Later, pressure increased to focus less on animal conservation and more on the welfare of urban-dwellers and tourism promotion. As from 1930 hunting permits were sold and in 1932 the journal of the Italian Alpine Club published an article proposing to transform the Gran Paradiso into a sort of huge open-air zoological garden, with all the features of an urban park. In the same years the Aostan autonomist politician Emile Chanoux lamented that until then the park had stressed too much its scientific aims, forgetting to respond to what it called its “social function”: "Ma il Parco non deve essere fine a se stesso; deve avere oltre che una funzione scientifica, anche una funzione sociale, deve essere un richiamo per le folle per una vita sana e naturale, deve essere una sorgente di vita per le popolazioni delle montagne sui cui è costituito, deve essere anche (e perché no?) la grande riserva di caccia della Nazionale, poiché anche questo sport della caccia ha motivo di sussistere per le sue utilità sociali."

mialondon commented 11 years ago

After discussing it with my friendly local multilingual historian and thinking over Wilko's issue, I wonder if there are two parts to the problem: the first is dealing with stop words in the appropriate languages, the second is NER (entity recognition) in other languages. Does dbpedia automatically query Wikipedia content from all languages or just English? If not, can we use the current language detection to query the appropriate instances as well as applying different sets of stop words? Thoughts @moltude ?

Also thanks @wilkohardenberg for your input and earlier comments!

moltude commented 11 years ago

I'm still thinking about this but I have a couple of thoughts so far:

I'm still chewing on this so any additional thoughts would be appreciated.

mialondon commented 11 years ago

Useful points, thanks! We could possibly assume that any non-English text is more pertinent and prioritise those queries - but do we actually need to run separate queries against the search APIs or do we just add non-English terms into the mix?

briancroxall commented 11 years ago

Perhaps in the meantime we can make it clear that Serendip-o-matic only supports English language text in the 1.0?

mbwolff commented 11 years ago

Hi everyone. I sent the pull request for FR stop words and was referred to this discussion (thanks Mia!). One way to solve this problem might be to break a text up into chunks and run guess-language on each chunk, aggregating results to build list of search terms. Chunks could be separated by punctuation and line breaks. This should work for Wilko's text above. For single words and short phrases from one language inserted into a text written mainly in another language, it may be too much trouble to determine the different languages.

mialondon commented 11 years ago

I was thinking paragraphs, as detected by various forms of line breaks (assuming they're still slightly different between OSs), how does that sound?

wilkohardenberg commented 11 years ago

If feasible it sounds good to me. Single words or short sentences should not be too much of a problem in most cases. I wonder however how this should work on a Zotero library: separate language guessing for each entry?

mbwolff commented 11 years ago

Paragraphs are natural chunks, so that works for me.

mialondon commented 11 years ago

Can https://github.com/chnm/serendipomatic/issues/78 be resolved at the same time?

We'll also have XML markup in various forms if people try copying other reference library formats - @moltude and @amrys came up with a good example of that

rlskoeser commented 11 years ago

Working by paragraph sounds like a feasible solution, although I worry about how that will scale to larger texts (although I suppose there are probably lots of parts of the code where larger text may cause issues). I also wonder if I could adapt the guess-language code to give multiple languages back if there multiple languages with very highly scores - it looks like it might be possible from glancing at the code, but I would need to experiment some. Is there likely to be a problem with combining stop words from all the languages detected? Although that doesn't help as much for knowing which dbpedia spotlight endpoint to use, I guess.

As for the #78 - we probably need some simple input type detection first - plain text, html/xml, csv, etc - and then do some pre-processing based on the input format before generating search terms.

mbwolff commented 11 years ago

Hi everyone. Combining stop words from different languages will create problems, e.g. "den" is an article in German and a noun in English.

mw

On Aug 11, 2013, at 6:12 PM, Rebecca Sutton Koeser notifications@github.com wrote:

Working by paragraph sounds like a feasible solution, although I worry about how that will scale to larger texts (although I suppose there are probably lots of parts of the code where larger text may cause issues). I also wonder if I could adapt the guess-language code to give multiple languages back if there multiple languages with very highly scores - it looks like it might be possible from glancing at the code, but I would need to experiment some. Is there likely to be a problem with combining stop words from all the languages detected? Although that doesn't help as much for knowing which dbpedia spotlight endpoint to use, I guess.

As for the #78 - we probably need some simple input type detection first - plain text, html/xml, csv, etc - and then do some pre-processing based on the input format before generating search terms.

— Reply to this email directly or view it on GitHub.

mialondon commented 11 years ago

We don't need to keep the paragraph structure, just pass things into a bucket for the appropriate language then push each one to the appropriate tokenisation, stop words and entity recognition steps... Though we might want to adjust the mix of query terms according to the proportional amount of each languages - too fussy?

(At some future point we may want to use the languages detected to query for objects from particular cultures or in particular languages, but that'd need to be considered carefully in relation to 'serendipity' and any future 'hint' function)

mialondon commented 11 years ago

Just a note that it might be easiest to work out and document design decisions on the wiki then return here to finish integrating them https://github.com/chnm/serendipomatic/wiki/Serendipomatic-architecture

mialondon commented 10 years ago

Do we need a chat to decide on the best solution? If so, who's interested?

mbwolff commented 10 years ago

I'm interested.

moltude commented 10 years ago

I'm also ready to dig back in on this.

On Tue, Oct 8, 2013 at 2:25 PM, Mia notifications@github.com wrote:

Do we need a chat to decide on the best solution? If so, who's interested?

— Reply to this email directly or view it on GitHubhttps://github.com/chnm/serendipomatic/issues/114#issuecomment-25915058 .

rlskoeser commented 10 years ago

I'm interested too.

mialondon commented 10 years ago

Cool! Is there an asynchronous way we can talk through the options or should we try for a chat? (I'm complicating things slightly by being in a completely different timezone).

moltude commented 10 years ago

I can make time 9-5 M/F for a chat if that makes the timezone problem easier (Mia are you GMT?). Other than a chat, I think the best way is to post to the Github issue tracking. Other ideas?

Thursday or Friday would be the best day for me this week if we wanted to setup a chat.

On Sat, Oct 12, 2013 at 8:07 PM, Mia notifications@github.com wrote:

Cool! Is there an asynchronous way we can talk through the options or should we try for a chat? (I'm complicating things slightly by being in a completely different timezone).

— Reply to this email directly or view it on GitHubhttps://github.com/chnm/serendipomatic/issues/114#issuecomment-26208724 .

mbwolff commented 10 years ago

This Friday afternoon (10/18), US East Coast time, would work for me. Could we videoconference?

mw

On Oct 14, 2013, at 9:17 AM, Scott Williams notifications@github.com wrote:

I can make time 9-5 M/F for a chat if that makes the timezone problem easier (Mia are you GMT?). Other than a chat, I think the best way is to post to the Github issue tracking. Other ideas?

Thursday or Friday would be the best day for me this week if we wanted to setup a chat.

On Sat, Oct 12, 2013 at 8:07 PM, Mia notifications@github.com wrote:

Cool! Is there an asynchronous way we can talk through the options or should we try for a chat? (I'm complicating things slightly by being in a completely different timezone).

— Reply to this email directly or view it on GitHubhttps://github.com/chnm/serendipomatic/issues/114#issuecomment-26208724 .

— Reply to this email directly or view it on GitHub.

mialondon commented 10 years ago

I'm GMT+11, the other East Coast Time (I'm in Australia). I could just about do 7am here, though I'd make more sense at 8am! http://www.timeanddate.com/worldclock/meetingtime.html?iso=20131018&p1=240&p2=179

mbwolff commented 10 years ago

I can meet Friday 8:00 AM Mia's time (Thursday 5:00 PM my time).

mw

On Oct 14, 2013, at 6:31 PM, Mia notifications@github.com wrote:

I'm GMT+11, the other East Coast Time (I'm in Australia). I could just about do 7am here, though I'd make more sense at 8am! http://www.timeanddate.com/worldclock/meetingtime.html?iso=20131018&p1=240&p2=179

— Reply to this email directly or view it on GitHub.

mialondon commented 10 years ago

Skype? I don't have a camera on the dinosaur laptop I'm travelling with so it's voice-only for me at the best of times.

moltude commented 10 years ago

Thursday 5:00 EST on skype would work for me.

On Tue, Oct 15, 2013 at 5:59 PM, Mia notifications@github.com wrote:

Skype? I don't have a camera on the dinosaur laptop I'm travelling with so it's voice-only for me at the best of times.

— Reply to this email directly or view it on GitHubhttps://github.com/chnm/serendipomatic/issues/114#issuecomment-26376189 .

rlskoeser commented 10 years ago

I'm available at thursday 5pm EST too. Is skype audio conference calling free? How do we exchange skype account names (prefer not to post them publicly, obviously). When OWOT team did video/audio chat last week it was kind of laggy and a bit difficult to communicate at times, which makes me wonder if a text chat might be more useful - but I guess skype has a chat tool built in that we can use if the audio is too laggy, right? Alternatively we could try a google+ hangout if we want to do video for those who have cameras.

mialondon commented 10 years ago

The document for collecting sample text for testing is 'Help us collect multilingual text for testing Serendip-o-matic' https://docs.google.com/document/d/100UygYyACS7tgU70FYpc4d00NTwoXaDzDmSUCu3naJE/edit#

mialondon commented 10 years ago

Here's a record of the decisions reached during our chat:

a) set up analytics to keep track of word count, languages b) hint function still useful future functionality, add language as an option c) start with sentence level, most common language determines which is used d) collect multilingual test samples for testing (inc poetry, TEI, whatever) e) check whether dbpedia is multilingual (I think the answer was yes?) f) these changes drive need for parallelisation g) help text on formatting text input (e.g. how to prepare BibTeX, TEI etc formatted text for inclusion) h) html/xml/whatever detection and graceful management i) check language options in source APIs j) refactor so Zotero input arrives at detection process looking like any other text

Of those, a, f, g will be new issues, b adds weight to #11, h is related to #78 and c, d, e, i and j are related to the original issue.

mialondon commented 10 years ago

Slightly off-topic, but this article on NER might be worth a look: 'Exploring Entity Recognition and Disambiguation for Cultural Heritage Collections' http://freeyourmetadata.org/publications/named-entity-recognition.pdf