Create a worker to do language detection

NAMD / pypln.backend

Pipeline for distributed Natural Language Processing, made in Python

http://pypln.org

GNU General Public License v3.0

65 stars 17 forks source link

Create a worker to do language detection #32

Closed fccoelho closed 12 years ago

fccoelho commented 12 years ago

Implement language detection for PyPLN. suggestion: http://code.google.com/p/chromium-compact-language-detector/

turicas commented 12 years ago

To install the package:

pip install chromium_compact_language_detector

To use it:

import cld
text = '...' # need to be UTF-8
result = cld.detect(text)
language, language_code = result[0].lower(), result[1]

fccoelho commented 12 years ago

It compiled just fine on my box. Maybe we can start to use it. We should also use the details offered by the detector such as percent confidence, and the normalized score in the case there is multiple candidates for the language: http://code.google.com/p/chromium-compact-language-detector/source/browse/bindings/python/README

we should also make a note to distribute a precompiled binary so that we don't need to require a full compilation chain on every node.

fccoelho commented 12 years ago

BTW, In the Dengue literature analysis project we already have multiple languages to handle. It will be a great test case. We need also to adapt the postagging worker to know what languages we have taggers for.

turicas commented 12 years ago

I think pip always compile everything. If the packager provided a binary/compiled form of the package, it can be installed using easy_install (I really don't know why pip doesn't have this feature, as it is intended to be an replacement for easy_install).

turicas commented 12 years ago

Fixed on 1f6f1fb93c5ddfbbdc2e0ac4d7533386433681f4 (pull request #49).