UniversalDependencies / UD_Portuguese-PUD

Parallel Universal Dependencies.
Other
5 stars 3 forks source link

tokenization of money question #11

Closed vcvpaiva closed 3 years ago

vcvpaiva commented 3 years ago

newdoc id = n01003 sent_id = n01003007 text = $5,000 por pessoa, o máximo permitido. texten = $5,000 per person, the maximum allowed. 1 $5,000 NUM CD 0 root OrigForm=$5000

should be like the English newdoc id = n01003 sentid = n01003007 text = $5,000 per person, the maximum allowed. 1 $ $ SYM $ 0 root 0:root SpaceAfter=No 2 5,000 5,000 NUM CD NumType=Card 1 nummod 1:nummod

this doesn't seem to be a real grammatical difference, but simply a notational one?

vcvpaiva commented 3 years ago

others:

  1. newdoc id = n01005 sent_id = n01005023 text = Por comparação, custou $103.7 milhões construir o interior da estação de metrô do NoMa, que abriu em 2004. text_en = By comparison, it cost $103.7 million to build the NoMa infill Metro station, which opened in 2004.

  2. sent_id = n01036033 text = Atualmente, a multa máxima que a RECO pode cobrar de um agente é $25,000. text_en = Currently, the maximum fine RECO can levy against an agent is $25,000.

  3. newdoc id = n01043 sent_id = n01043005 text = Os executivos também receberam a chamada "remuneração por desempenho" por corresponderem ou superarem as expectativas, dividindo uma quantia de $1,5 milhões entre eles, ou em média, $15.000 para cada um deles. text_en = The executives also received so-called "performance pay" for succeeding or surpassing expectations, sharing a pot of $1.5 million among them, or an estimated $15,000 each on average.

4.sent_id = n01043014 text = O orçamento anual é maior que $1.4 bilhões e emprega mais de 6.000 pessoas. text_en = Its annual budget is more than $1.4 billion, and it employs more than 6,000 people.

dan-zeman commented 3 years ago

Agreed, it should be separate. The other two Portuguese treebanks also separate "$" from the amount. (But those treebanks disagree in whether "US$" is one token, or two: "US" + "$".)

arademaker commented 3 years ago

The commit 41a1268 solved this issue. I was not able to run the check_files.pl script. But I checked using the python code:

% python3 ~/work/ud-tools/validate.py --lang pt pt_pud-ud-test.conllu
*** PASSED ***

I wonder what is the difference between the Perl and Python validation scripts. BTW, looks like @dan-zeman solved all remain validation errors?

Commit 6955e9c updated the stats.xml

dan-zeman commented 3 years ago

The Perl script checks the contents of the repository, file naming conventions, any unexpected extra files and the contents of the README file. If it did not report any errors before and you only modified the contents of the CoNLL-U file, check_files.pl should not report any new errors.

And yes, I fixed the other errors :-) (That is, those identified by the script. Not those identified by @vcvpaiva.)