NaturalNode / natural

general natural language facilities for node
MIT License
10.63k stars 860 forks source link

PoS Tagger for Brazillian Portuguese #350

Closed diegodorgam closed 7 years ago

diegodorgam commented 7 years ago

Hi there, I'm using this awesome lib in a chatbot project called hubot-natural and I'm having trouble to use the PoS Tagger feature to recognize Brazilian Portuguese. Is there any chance to have the Brills PoS tagger translated? Or is that something more complex than translating. If it helps funding, I should say I'm willing to pay for it's translation.

Thks in advance.

Hugo-ter-Doest commented 7 years ago

It is not a matter of translating the tagger. You need to train the tagger in order to derive transformation rules from data. At the moment a training procedure is not included with the tagger here. NLTK does have a trainer. I think natural should have a trainer as well...

Hugo

diegodorgam commented 7 years ago

Right! I've came to the same understanding after some research. The question now is, can we use some other training tools for this tagger? here is something that I found on the web, if that's of any help: http://lxcenter.di.fc.ul.pt/tools/en/LXTaggerEN.html http://www.linguateca.pt/ferramentas_info.html https://opennlp.apache.org/docs/1.7.2/manual/opennlp.html http://streamhacker.com/ (this guy seems to use ngrams for training) http://www.nilc.icmc.usp.br/nilc/tools/nilctaggers.html https://sourceforge.net/projects/aelius/?source=typ_redirect

thks for any contribution!

Hugo-ter-Doest commented 7 years ago

For a Brill tagger you need a trainer that learns transformation rules of the form OLD_CAT NEW_CAT PREDICATE PARAMETER For instance: NN CD CURRENT-WORD-IS-NUMBER YES VBD NN PREV-TAG DT

NLTK includes a trainer for this (in Python): http://www.nltk.org/_modules/nltk/tag/brill_trainer.html

And, of course there is the original code by Eric Brill (in C): http://www.tech.plym.ac.uk/soc/staff/guidbugm/software/RULE_BASED_TAGGER_V.1.14.tar.Z

Hugo

diegodorgam commented 7 years ago

thks @Hugo-ter-Doest I'll give it a try...

Hugo-ter-Doest commented 7 years ago

I'm considering the following transformation-based learning algorithm for extending the Brill tagger: http://acl-arc.comp.nus.edu.sg/archives/acl-arc-090501d4/data/pdf/anthology-PDF/W/W94/W94-0111.pdf It is fast at the cost of memory usage because it stores relevant rules at corpus locations (sites). Since this is quite an old paper from 1994 we can live with some memory consumption I think.

Hugo

Hugo-ter-Doest commented 7 years ago

I implemented the algorithm from the paper by Ramshaw and Marcus. See pull request #353

Hugo

kkoch986 commented 7 years ago

looks awesome! just merged i wanted to look at one other thing tonight and then ill get a version out on npm

diegodorgam commented 7 years ago

hi guys, @Hugo-ter-Doest I found a Brill corpus for portguese in this link http://www.nilc.icmc.usp.br/nilc/download/corpusjournalistic.txt and tried to follow the readme.md to create rules and the lexicon file, using the templates

var templateNames = ["NEXT-TAG",
"PREV-WORD-IS-CAP",
"PREV-1-OR-2-OR-3-TAG",
"PREV-1-OR-2-TAG",
"PREV-TAG",
"NEXT-WORD-IS-CAP",
"CURRENT-WORD-IS-CAP",
"CURRENT-WORD-IS-NUMBER",
"CURRENT-WORD-IS-URL",
"CURRENT-WORD-ENDS-WITH",
"PREV-WORD-IS",
"NEXT-WORD-IS"];

but when I get to the this part var Tester = require('natural.BrillPOSTrainer'); I keep getting an error:

Error: Cannot find module 'natural.BrillPOSTrainer'
    at Function.Module._resolveFilename (module.js:469:15)
    at Function.Module._load (module.js:417:25)
    at Module.require (module.js:497:17)
    at require (internal/module.js:20:19)
    at repl:1:14
    at sigintHandlersWrap (vm.js:22:35)
    at sigintHandlersWrap (vm.js:96:12)
    at ContextifyScript.Script.runInThisContext (vm.js:21:12)
    at REPLServer.defaultEval (repl.js:313:29)
    at bound (domain.js:280:14)

I have followed the training examples step-by-step, have loade the natural libs, but I can't load this last section and can't execute the var trainer = new Trainer();.

The corpus is being generated right, but I'm can not get the templates and the ruleset to be generated, any ideas?

thks in advance.

diegodorgam commented 7 years ago

BTW: I'm trying to use this POSTagger function in another project of a chatbot framework, called HubotNatural, https://github.com/RocketChat/hubot-natural, for the rocket.chat platform, maybe you guys would like to take a look... The aim here is to use the BrillPOSTagger to identify entities and it's values in a message sent to the chatbot... Hope you like it.

thks again.

Hugo-ter-Doest commented 7 years ago

Please post your code so I can try to reproduce what you are doing.

Also, have a look at the test file in the spec folder. spec/brill_pos_trainer_spec.js

Hugo

diegodorgam commented 6 years ago

Great @Hugo-ter-Doest ! following the spec file I was able to build the lexicon and the ruleset files! Now I'm facing another problem, the ruleset file has special characters like ç é ê ã and once I try to load it, it throws a SyntaxError, like this:

> var rules = new natural.RuleSet(rulesFilename);
{ [SyntaxError: Expected " ", "*", "//", "\n", "\r", "\r\n", "\t", [a-zA-Z_0-9_\-.,()] or end of input but "\xE9" found.]
  message: 'Expected " ", "*", "//", "\\n", "\\r", "\\r\\n", "\\t", [a-zA-Z_0-9_\\-.,()] or end of input but "\\xE9" found.',
  expected:
   [ { type: 'literal', value: ' ', description: '" "' },
     { type: 'literal', value: '*', description: '"*"' },
     { type: 'literal', value: '//', description: '"//"' },
     { type: 'literal', value: '\n', description: '"\\n"' },
     { type: 'literal', value: '\r', description: '"\\r"' },
     { type: 'literal', value: '\r\n', description: '"\\r\\n"' },
     { type: 'literal', value: '\t', description: '"\\t"' },
     { type: 'class',
       value: '[a-zA-Z_0-9_\\-.,()]',
       description: '[a-zA-Z_0-9_\\-.,()]' },
     { type: 'end', description: 'end of input' } ],
  found: 'é',
  offset: 126,
  line: 4,
  column: 30,
  name: 'SyntaxError' }

Any thoughts on this? should the RuleSet method accept those special characters or should I remove them from the Brown Corpus?

I've published a gist with all the training code in case https://gist.github.com/diegodorgam/3315a9071f8a5e336c89d44c879f8ae1

Thanks again!

Hugo-ter-Doest commented 6 years ago

I understand that you were able to generate a rule set from the corpus and now want to read it back in. Then you get the errer above.

I checked the parser for transformation rules in TF_Parser.pegjs, and I see that it cannot handle special characters:

parameter = identifier
identifier =
  characters: [a-zA-Z_0-9_\-\.,()]+ S_no_eol
  {
   var s = "";
   for (var i = 0; i < characters.length; i++) {
     s += characters[i];
   }
   return(s);
  }

I will try to fix this in the weekend. Maybe you can provide some example rules with special characters so I can test it?

Hugo

diegodorgam commented 6 years ago

Great @Hugo-ter-Doest, thanks for your effort =) Yeh, I could generate the lexicon and the ruleset files, they're avaiable in https://github.com/diegodorgam/postagger in the pt_br-ruleset.txt file.

I've also found this RegExp reference for accent letters, maybe it can help https://stackoverflow.com/questions/17480177/regex-for-matching-accent-characters

thanks again!

Hugo-ter-Doest commented 6 years ago

Created a pull request to solve this issue: 381 It now supports diacritics from Latin-1 Supplement as defined in List_of_Unicode_characters

And I found that predicates without parameters could not be parsed. Also solved that issue.

Hugo

diegodorgam commented 6 years ago

Great @Hugo-ter-Doest, thanks for this.

I've applied the diff to my local repository of natural and changed the scripts to call it from local, just for testing, and now I'm able to load the ruleset.

But it not seems to be working though, every tagged word in the sentence comes with the defaultCategory.

I've tried simplifying the ruleset by decreasing the number of templates, but still got the same results. If I don't specify the default category, my tags come as null. I've updated the code in https://github.com/diegodorgam/postagger.

here is what I'm doing after training, I've created a file called test.js to tag a sentence that contains the following code:

var natural = require("../natural");
var path = require("path");

var rulesFilename = "./pt_br-ruleset.txt";
var lexiconFilename = "./pt_br-lexicon.json";
var defaultCategory = 'NN';

var lexicon = new natural.Lexicon(lexiconFilename, defaultCategory);
var rules = new natural.RuleSet(rulesFilename);
var tagger = new natural.BrillPOSTagger(lexicon, rules);
var sentence = "Antes de iniciarmos o estudo de origem da vida".split(' ');
console.log(JSON.stringify(tagger.tag(sentence)));

node test.js [["Antes","NN"],["de","NN"],["iniciarmos","NN"],["o","NN"],["estudo","NN"],["de","NN"],["origem","NN"],["da","NN"],["vida","NN"]]

weird huh?

when I inspect the code, it seems that 'lexicon' and 'rules' are loaded correctly, but at the end of the code, all word come tagged as 'NN' (the defaultCategory).

The phrase that I am using is present in the original corpus.txt, so I'm sure that it's present in the training.

Any ideas? Did I miss something in the way here?

Hugo-ter-Doest commented 6 years ago

It took me a while to see what is going on. Then I discovered that the lexicon only had 3 entries...

Your lexicon has the following format:

{
  "lexicon": {
    "1": [
      "NC"
    ],
.
.
.
  },
  "defaultCategory": "NN",
  "defaultCategoryCapitalised": "NP"
}

When reading in the lexicon, it expects only a mapping from words to arrays of categories:

{
    "1": [
      "NC"
    ],
.
.
.
}

So when you save the lexicon to a file only save lexicon.lexicon.

After that the tagger works.

Hugo

diegodorgam commented 6 years ago

Awesome @Hugo-ter-Doest , it works perfectly! can't tell you how much I appreciate your effort on this! Thank you very much for everything!

kkoch986 commented 6 years ago

fyi just merged the PR will bump the version once i finish getting caught up on everything