clips / pattern

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
https://github.com/clips/pattern/wiki
BSD 3-Clause "New" or "Revised" License
8.72k stars 1.58k forks source link

how extend to other languages? #8

Open napo opened 13 years ago

napo commented 13 years ago

There is a documentation that explain how extented pattern to another language? (in my case in italian)

tom-de-smedt commented 12 years ago

Including other languages is non-trivial. You'd need a lexicon of Italian words for the Brill tagger, and grammar rules for nouns and verbs. I'm interested in support for German, Spanish, French and Italian but it might be a while (months) before I start looking into it. There's also TreeTagger which has Italian support, I believe.

gka commented 12 years ago

+1 on this. I'd really like to start using pattern on German texts. How is the current state of this issue? How can I help?

tom-de-smedt commented 12 years ago

Some developer documentation has been added to the website on how to extend to other languages: http://www.clips.ua.ac.be/pages/pattern-dev#language

In short, what we'd need is a Brill lexicon in German, or a German reference corpus (like Brown or British National Corpus for English). Schneider and Volk have trained the Brill-tagger for German with decent results (95-96% accuracy): https://files.ifi.uzh.ch/cl/PAPERS/SchneiderVolk98.pdf

I don't know how responsive they would be to a request to merge their work into Pattern (I haven't asked), but this would be the logical place to start. It may help if we can show them that there is a real need for this.

tom-de-smedt commented 12 years ago

Support for German is under way. I've asked Schneider & Volk and currently integrating their work. Once it's finished and it has their approval we'll have a new pattern.de module. Once it's there any help fine-tuning it will be appreciated since I am not a native German speaker.

gka commented 12 years ago

Yeah, that's great! Will try to help w/ the fine-tuning. Looking forward to this! Thanks.

tom-de-smedt commented 12 years ago

Pattern 2.4 includes a German tagger-chunker.

If you download the new version, you can try it out with:

from pattern.de import parse
print parse(u'Die Katze liegt auf der Matte.')

To elaborate on this: the language model was contributed by Schneider & Volk (1998). They report the accuracy of the tagger around 95% for text with 15% unknown words (i.e., 15% of the words are not in the Brill word lexicon). The tagger will report any verb as "VB" (so no "VBG" or "VBZ") as it is not that easy to determine verb tense in German, as I understand it. The tagger-chunker can lemmatize plural nouns and conjugated verbs, using a probabilistic approach. For noun singularization, the accuracy is around 84%, for verb conjugation it is 87%. This is not very impressive but it is a start. You can examine the lemmatization algorithms in pattern/text/de/inflect/init.py. They may benefit from your suggestions since, as I mentioned, I am not a native German speaker. I've included a few source code comments about German grammar to the best of my ability. The Brill lexicon and lexical rules can be examined in pattern/text/de/parser/. If you have any remarks I will be happy to hear them. T

adrianva commented 11 years ago

Hi! I'd like to help you with the Spanish language, even though I am still starting to learn about NLP.

Greets!

tom-de-smedt commented 11 years ago

Hi Adrián,

Support for Spanish would be great. To accomplish this, we either need a free Spanish Brill tagger, or a dataset of manually tagged Spanish text to train Brill's algorithm on. Some more information is here: http://en.wikipedia.org/wiki/Brill_tagger

After doing a quick search on Google I found the following:

There is a Spanish Brill tagger here but it is unclear what software license they use: http://www.findthatzip.com/search-17889607-hZIP/winrar-winzip-download-Spanish-Brill.zip.htm

Here is a paper by Aone & Hausman that explains how they trained a Spanish Brill tagger: http://acl.ldc.upenn.edu/C/C96/C96-1011.pdf It might be interesting to contact them.

Work on a manually tagged dataset is here: http://www.lllf.uam.es/~sandoval/UAMTreebank.html But there license is for non-commercial use only so we'd need to contact them to hear if we could use it for Pattern (which has a free license).

Best, Tom

adrianva commented 11 years ago

Thanks for the answer. I am reading the links in order to get used to the process.

Thanks again for the information!

Greets, Adrián.

tom-de-smedt commented 11 years ago

Support for Spanish is coming. Right now I have something that is about 88-92% accurate. I will give it a few more days to try and improve the accuracy, and then push the code to GitHub.

There will be some changes to the API, because in its current state Pattern is designed for Germanic languages (such as English); and Romance languages (such as Spanish) are somewhat different / more complex.

adrianva commented 11 years ago

Wow! I'm looking into it.

I'm trying to develop that feature but it's harder than I thought; so that update would be perfect.

Thanks for your work!

mmaker commented 11 years ago

I would like to work on the Italian support. do you think http://dslo.unibo.it/coris_eng.html is fine, Tom?

Esiravegna commented 11 years ago

Hey, first and foremost, thanks a lot for your work. A quick question, is there a sentiment for spanish module planned? If so, any idea on how it works? Ability to share the code?

Thanks a lot!

tom-de-smedt commented 11 years ago

I'd love to, but my Spanish may not be good enough to assess polarity (good / bad) of Spanish words. The basic idea is that you have a list of frequently used Spanish words (e.g., adjectives like "fantástico" and "aburrido"), each with a score between -1.0 and +1.0. You could find the most frequent words by counting words in a large collection of texts, for example 10,000 Amazon.es product reviews. You could also manually compose a list of adjectives from the web, for example from Wiktionary, or simply by Googling. Then, you'd need some reviewers to assign a score to each word (1 reviewer is better than 0, but 2-3 give more reliable averaged scores).

If you can provide me with such a list I can do the rest of the work. Also, let me know if I can help out by writing code to mine for frequent Spanish adjectives.

French sentiment analysis was done during a workshop at Fabelier hacklab (Paris University). The workshop notes and scripts are still online, with miners for Amazon.fr and other useful code. You may find this a useful resource: https://github.com/fabelier/tomdesmedt

Esiravegna commented 11 years ago

Thanks a lot, Tom! I'll keep you posted, (let me know if this is the right channel) on the polarity list. Now, I've a question. How is supposed pattern to read something like 'Esto es una porqueria' ( this sucks), as porqueria is actually a noun?

Thanks a lot again!

tom-de-smedt commented 11 years ago

You can of course add strong words like "porqueria" to the list and assign a score to them. It doesn't have to be limited to adjectives. However, nouns and verbs are often harder to assign a score to, since they are usually associated with a feeling instead of expressing a feeling. "Explosion" might sound negative, but in "the party ended with an explosion of fireworks", it is not negative. In "I hate parties that end with an explosion of fireworks" it is not negative either, although the sentence as a whole is, since it contains "hate". But: "many dogs hate cats" is neither positive nor negative, etc.

tom-de-smedt commented 11 years ago

Support for Italian is (finally!) coming. The steps of what we did are outlined here: http://www.clips.ua.ac.be/pages/using-wiktionary-to-build-an-italian-part-of-speech-tagger

The current revision is experimental. Fabio Marfia (Politecnico di Milano) is doing some work to improve it. Overall, the accuracy is currently 92% measured on the WaCKy corpus (Baroni, Ferraresi et al., 2009).

kinow commented 9 years ago

Hi Tom!

I've used pattern for experimenting with new data quickly, but usually I utilize OpenNLP. I will try to follow what was done for Italian, but for Brazilian Portuguese. What do you think? Any pointers?

I found one page with data for a brill tagger (I suppose), but I'm not familiar with brill and can't say if its quality is acceptable - http://www.nilc.icmc.usp.br/nilc/tools/nilctaggers.html Thanks!!

imarban commented 8 years ago

Hi!

How is it going the task about sentiment analysis for spanish language? Is anybody working on this?