SubtitleEdit / subtitleedit

the subtitle editor :)
http://www.nikse.dk/SubtitleEdit/Help
GNU General Public License v3.0
8.39k stars 891 forks source link

[Request] Add Firefox pt-BR dictionary #3419

Closed ticao2 closed 4 years ago

ticao2 commented 5 years ago

I ask you to add the Firefox pt-BR dictionary in the installation program if possible. I know there is a tool for adding new dictionaries. But it seems that the available dictionary in pt-BR is the same one used by LibreOffice. An immense dictionary of 311630 words. It seems to me that a dictionary with just the words of everyday use is more suitable for our use in SubtitleEdit.

Source http://kb.mozillazine.org/Dictionaries I found 4 pt-BR versions of Firefox.

3 versions here https://addons.mozilla.org/pt-BR/firefox/language-tools/ Firefox Spelling Old https://addons.mozilla.org/en/firefox/addon/ortografia-br/ Firefox Large - with 311630 words https://addons.mozilla.org/pt-BR/firefox/addon/verificador-ortogr%C3%A1fico-para-p/ Firefox Mediun - with 41308 words https://addons.mozilla.org/pt-BR/firefox/addon/corretor/

1 version here http://dictionaries.mozdev.org/installation.html Firefox Small Original - with 25165 words. This is the best for us. http://downloads.mozdev.org/dictionaries/spell-pt-BR.xpi

I already sent the files to your email. pt-BR.aff + pt-BR.dic + README_pt_BR.txt If you need anything else, please, you can ask. Thankful for your attention.

gabriellluz commented 5 years ago

I don't think this is a good choice since Mozilla's dictionary is not updated since 2003 and we had some huge changes in the last years afaik.

gabriellluz commented 5 years ago

I'm Brazilian btw.

OmrSi commented 5 years ago

https://github.com/SubtitleEdit/subtitleedit/issues/3323

ticao2 commented 5 years ago

I just did a check and it seems that this minor dictionary (25165 words) is not following the rules of the Brazil Portugal Orthographic Agreement.
But the smaller one (41308 words) apparently is.
On his page, as an addon to Firefox, it is stated that it follows the Rules of the Agreement.
https://addons.mozilla.org/pt-BR/firefox/addon/corretor/ If you could leave it as an installation option it would be fine.

gabriellluz commented 5 years ago

If you're really sure about what you're saying, I second that.

ticao2 commented 5 years ago

I use Notepad++ regularly. I put as dictionary in Notepad++ the Vero, with 311630 words. I opened in Notepad++ the .dic file of the Addon dictionary, now with 41824 words. What was highlighted as "error" were: Some words from Juridiques - absurdum, variations of the verb adequar, aditivar, Some proper names of people - Abercrombie, Abraham, Abraham, Abrams, Abranches, Abravanel, Adams, Ademilson, Adilson, Adoniran, Some proper names - Abba, Abbey, Adamantium Some acronyms - Abert, abs, admin, Some slang - achômetro, Some foreign words - about, academy, accountability, adieu, adiós Some brands - Absolut, Access, Acer, Acrobat, Activia, Adblock Some rare verbs - acariocar,

Many proper nouns exist in both dictionaries. Of people, of American and Brazilian states, of Countries. Half of the rules, half of the file .aff, apply to Próclise, Ênclise e Mesóclise. Michel Temer, live, thank you. And Jânio Quadros, dead, too.

I looked for some common words in the old spelling: vôo, enjôo, idéia, pára-quedas, para-quedas, conseqüência, agüentar, seqüestro, feiúra. I did not find it. I just found the new spelling.

Conclusion: I believe that the Mediun Dictionary, with 41824 words, is in accordance with the Orthographic Reform.

niksedk commented 5 years ago

Do you have a link for the medium dictionary? (preferably zip or oxt) What should it be called in SE? "Portuguese (Brazilian new spelling)" ?

Also, I'll update SE Portuguese (Brazilian) to latest: https://extensions.libreoffice.org/extensions/vero-verificador-ortografico-e-hifenizador-em-portugues-do-brasil

ticao2 commented 5 years ago

1 Which name to use No, do not use "new spelling" as the name. Both dictionaries follow the rules of the Orthographic Agreement. So that's not what sets them apart. The difference is the amount of words.

We, by convention, call it a dictionary. But in fact it is not a dictionary. The words do not have an explanation of their meaning. What we have is a large list of words. And the larger this list is, the greater the chance of a typing error being accepted because there may be a valid word that is written as in that error. For example: there are 3 words in Portuguese that have the same sound, but the spelling and the meaning are different: cessão. sessão and seção, The sound is the same. But see the translation: cessão de direitos = assignment of rights. sessão de cinema = movie session seção administrativa = administrative section

This is a famous example but there are many other pitfalls. mau e mal = bad and evil. senso e censo = sense and census cela e sela = cell and saddle mandado e mandato = warrant and mandate concerto e conserto = concert and repair

A typo and we may have an element from the periodic table, or a rock type, a legal term, a disease or a bacterium. The word will be accepted, or rather will not be highlighted or underlined, but it will be wrong.

2 Link I'll look for a link to the dictionary.

ticao2 commented 5 years ago

I believe I have found the original dictionary that is in the Firefox installations.
https://hg.mozilla.org/releases/l10n/mozilla-aurora/pt-BR/file/81d9d556fa78/extensions/spellcheck/hunspell
On this page there is a menu, and in the menu we find the ZIP item. https://hg.mozilla.org/releases/l10n/mozilla-aurora/pt-BR/archive/81d9d556fa78.zip/extensions/spellcheck/hunspell/
There are other items, but I confess I do not know if they are better.

I need to do a test in Notepad ++ with the .dic file of this package I found. It has 41308 words. In the Readme.txt file I found a Link to the responsible. http://natura.di.uminho.pt/ I'll take a look.

As for the name I believe we can make a reference to Firefox and / or size.

<EnglishName>Portuguese-BR (Firefox)</EnglishName>
<NativeName>Português-BR (Firefox)</NativeName>

<Description>Dicionário Português-BR - Firefox (pequeno  41308 palavras)</Description>
<Description>Dictionary Portuguese-BR - Firefox (small  41308 words)</Description>
niksedk commented 5 years ago

The readme says it's from Portugal ?

ticao2 commented 5 years ago

Yes, in Readme it's from Portugal. Although it is in the pt-BR folder. :-)
So I checked in the Notepad++ the .dic file, 41308 words, using the Mediun dictionary, 41824 words, which I got in the Addon. I checked all the letter A. I found these "mistakes":

abrido/fp~ [I think it's a verb] aclamativo Afif [name of person] poró (alho-poró) [name of a vegetable] Al-Qaeda [Own name] Asha [name of person] ateve (ateve-se) au-au [children's slang for dogs] austro (austro-húngaro/fp) Baidu [Own name]

These are just word list version variations.

the Mediun Dictionary, the Addon, also has a readme. And there we can read that the dictionary was based on the pt-PT version. So I guess it's just a mistake, a lapse, from whoever provided that readme. It is in the pt-BR folder but states that it is pt-PT.

Comment: All these dictionaries have in their word lists names of people, US and Brazilian state names, acronyms, marks (Acer, Adobe), words from other languages (adieu, adios) ... It's a big mess. All of them suffer from a lot of influence from legal people. So all the legal terms are there. I could not find a list of words most used in everyday life by ordinary people. Probably, without the names and marks, it would be half the size.

ticao2 commented 5 years ago

Hello @niksedk
I do not want to bother.
Is there any chance that this dictionary will be added to the options?

gabriellluz commented 5 years ago

Hello @niksedk I do not want to bother. Is there any chance that this dictionary will be added to the options?

Did you check if that new dictionary is compliance with the new ortographic agreement?

If you did, then I really want that dictionary to be included.

ticao2 commented 5 years ago

Yes, I checked.
As I explained earlier, I used my Notepad++ with the VERO spelling checker, the large one with 311630 words. I opened as a text file, in Notepad ++, the .dic file And I checked the "mistakes". What Vero found were not errors but only words like proper names, acronyms, slang, etc ...
Nothing at odds with the Ortographic Agreement.
I have checked all words that begin with the letter A.
In addition, I checked some words that are not valid in the Orthographic Agreement:
vôo, enjôo, idéia, pára-quedas, para-quedas, conseqüência, agüentar, seqüestro, feiúra.
I did not find any. Only with the New Orthography.

Therefore, I believe that I can say that this dictionary follows the rules of the New Orthographic Agreement.

ticao2 commented 4 years ago

Any chance of adding this dictionary option to be downloaded and installed?
So we would have both options: The current dictionary, LibreOffice with 311630 words,
And the new dictionary, Firefox, with 41308 words.

<EnglishName>Portuguese-BR (Firefox)</EnglishName>  
<NativeName>Português-BR (Firefox)</NativeName>  

Link https://hg.mozilla.org/releases/l10n/mozilla-aurora/pt-BR/file/81d9d556fa78/extensions/spellcheck/hunspell

Download https://hg.mozilla.org/releases/l10n/mozilla-aurora/pt-BR/archive/81d9d556fa78.zip/extensions/spellcheck/hunspell/

niksedk commented 4 years ago

SE includes the dictionary linked from libreoffice: https://extensions.libreoffice.org/extensions/vero-verificador-ortografico-e-hifenizador-em-portugues-do-brasil

After googling a bit a newer one is here: https://github.com/elastic/hunspell/tree/master/dicts/pt_BR

Word count is not the most important thing about dictionaries - so please test some subtitles with different dictionaries and compare the results!

xylographe commented 4 years ago

The only difference between these two is the version number: Versão 2.1.4Versão 3.2. The actual content of aff/dic is identical.

ticao2 commented 4 years ago

Word count is not the most important thing about dictionaries

We, by convention, call it a dictionary. But in fact it is not a dictionary. The words do not have an explanation of their meaning. What we have is a large list of words. And the larger this list is, the greater the chance of a typing error being accepted because there may be a valid word that is written as in that error. A typo and we may have an element from the periodic table, or a rock type, a legal term, a disease or a bacterium. The word will be accepted, or rather will not be highlighted or underlined, but it will be wrong.

It seems to me that the dictionary needs to have only the most used words in everyday life. There is no need to have rare and obscure words of restricted use in different areas of knowledge.

so please test some subtitles with different dictionaries and compare the results!

I've tested and used it several times. That's why I made this request. But it is not a serious problem. Just that every time I install a new version, I place the 2 files from the Firefox dictionary in the appropriate folder.

xylographe commented 4 years ago

Just a thought: perhaps, a community plug-in that offers dictionaries from different sources, and shows a little more information about the maintainers, and the pros and cons of the dictionary, and, of course, a link to a recent version of that dictionary.