Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
697 stars 132 forks source link

Arabic (Gulf) (afb) #1084

Closed RyckRichards closed 8 years ago

RyckRichards commented 8 years ago

[for dev.tatoeba.org] CALL add_new_language('afb', 4235);

[for tatoeba.org] CALL add_new_language('afb', 5877);

http://dev.tatoeba.org/eng/sentences_lists/show/4235

damascene commented 8 years ago

To which delicate in Gulf this newly language belongs? is the next step is to create a British, Australian, South African and Canadian English to cover all the dialects?

RyckRichards commented 8 years ago

As far Gulf Arabic has an ISO 639-3 code and some members were posting sentences in this language and requested this language, it was added. British/Canadian English dont have it

damascene commented 8 years ago

well it's slang. Arabic (ara) is the official language for all the Arabic countries and the language used in the Arab. there is nothing called Gulf Arabic there. I've lived in the Gulf.

And the highest Arabic institution is the Arab League where leaders from all Arab countries gather has the official language as Arabic (ara) https://en.wikipedia.org/wiki/Arab_League

the sentences you added as I can see is slang. and there is different slang words in every Arabic country. Could you please provide more solid reason to have this language? could you invite those users who requested this language somewhere here?

trang commented 8 years ago

@damascene, as of today, we don't take responsibility for language classification. We rely on the ISO 639-3 classification. As long as a language is present in the ISO 639-3 classification, we consider that it is fine to add it as a language in Tatoeba. Since Gulf Arabic is recognized as a language in this classification (represented by the code afb), by default we won't oppose to add it to Tatoeba.

If you think that Gulf Arabic should not be considered as a language, we would need some academic reference.

If you think that the sentences in the list are not representative of the Gulf Arabic language, that's another problem.

SafaAlfulaij commented 8 years ago

After looking at the current statistics of Arabic "dialects" in Tatoeba, I can say that adding new dialects are useless, as these "languages" contains only few sentences. If you considered adding "Arabic Gulf" "language", then how could that differs if we are talking about a Bahraini dialect or an Omani dialect? Or even more, a sentence that is spoken in a specific town in Bahrain and another town in Saudi Arabia? If we were going to add any language in that ISO 639-3 if someone requested it, we'll end with thousands of languages that just 5% of them are important (main/standard) and others are dialects that just contain few sentences. The user who requested adding this is also not Arabian at all, dialects should be requested from native people and sentences should be written by them (if it was a right thing to do). And one thing, just two sentences (among all of the ones in the list) are really "Gulf" and others are some other dialects. What do you think of me if I created 10 accounts and requested adding some Arabic dialects and fill them with one sentence or two? There is a really little effort that is done for Arabic language, and we are creating and adding more and more dialects that split that effort into several dialect sentences that are useless and look bad. What I think is adding a language named "Arabic dialects" that contains all the dialects separated by tags, and when a language reach to (lets say) 5000 sentences we can add it as a separate "dialect" because it would be Missy to have it there.

trang commented 8 years ago

The fact that some languages have barely any sentence is not a justification for rejecting the language. There will be a day when Tatoeba supports every language in the ISO 639-3 classification (or another classification if there is a better one). Ideally, we should have imported the full list of languages, but we can't do it because we would first need to re-adapt the language dropdown, so that users can search and find more easily their languages. We would also need to fix some parts of the code to handle the case of languages that have no sentences. But there's no reason why we shouldn't support all the languages by default.

To give you a comparison, if you go to your profile, you can mention which country you live in or are from. You can select this country from a list. Except in some rare cases, you won't have to make a request for us to add a country. Some countries have currently no user at all and maybe will never have any user, but it doesn't matter. We won't remove these countries from the list just because of that. And if today you notice that we have a country missing in this list, even if you don't live in that country, you can let us know we will add it.

For some reasons, cueyayotl would like Gulf Arabic to be added. The fact that he's not a native speaker of the language and the fact that out of the 10 sentences he added, only 2 sentences are in Gulf Arabic language, is definitely not desirable. But this is not a good reason for us to oppose to the addition of this language.

If you tell me however that having all these "Arabic dialects" makes you reluctant to contribute to Tatoeba, or causes bad user experience, and you can rationally explain why, I can try to review our rules regarding language requests. Right now I can't see why. Adding more Arabic languages is not going to split the effort. Arabic contributors can simply ignore the other Arabic languages, and focus on contributing purely in standard Arabic.

damascene commented 8 years ago

apparently many languages are identified as Arabic languages more than 30 in this ISO 639-3 http://www-01.sil.org/iso639-3/documentation.asp?id=ara I really have not heard ever that any one of those are considered language among Arabs though we sometimes have hard time to understand some dialects in north Africa because they suffered from aggressive effort to erase their language and religion, so many there mix many French words while they speak with Arabic. Also there are they Egyptian Arabic that you may find some users from Egypt that maybe interested in creating a language of their own as they did in Wikipedia.

Do you think that it right to accept language request without consulting it's community? I find the different between the Egyptian Arabic and other Arabic dialects is like the difference between the British, Scottish and Irish English. It's not fair for someone outside our community to decide what our language should be named or called or how many variations should it have.

I hope you really understand the shock to us to have those dialects considered as languages while English with all it's dialects considered as one solid language. Surprise is really a little word for this. We do not know how ISO 639 -3 was created but we are here talking to a community project. hopefully it's not odt vs oxml.

trang commented 8 years ago

@damascene, you know, language classification is not easy. You can read the following Wall threads to get an idea:

There will always be disagreements about what should be a language and what shouldn't be. When people disagree, what should we do? In Tatoeba's case, we tried to find a standard we could rely on, because Tatoeba is not in a stage where we can decide on our own how to draw the borders between languages. The ISO 639-3 is an attempt to define these borders. It is not perfect, but it's the best standard we've found (not like there are a lot of standards out there anyway).

We assume that if they defined Gulf Arabic as a language, there were reasons for this. They consulted Arab linguists and looked into Arab literature and evaluated that it can be considered as a language. They hopefully didn't just decide it on a flip of a coin.

That being said, there is no rush for adding Gulf Arabic as a supported language in Tatoeba. We can put his request on hold for now. I also proposed an alternative in https://github.com/Tatoeba/tatoeba2/issues/1079#issuecomment-204112351 which should be considered before there is further discussion about adding Gulf Arabic.

cueyayotl commented 8 years ago

I think TRANG did an excellent job explaining. As the "Non-Arabian" who requested the language, let me answer a few more of these points. First, we are all aware that Modern Standard Arabic arb is the ONLY Arabic language considered official in ANY country. I explain my reason for requesting Gulf Arabic to be added in the following point:

After looking at the current statistics of Arabic "dialects" in Tatoeba, I can say that adding new dialects are useless, as these "languages" contains only few sentences... What do you think of me if I created 10 accounts and requested adding some Arabic dialects and fill them with one sentence or two?

Since we follow ISO 639-3 classification standards, we consider them Arabic languages (with dialects and subdialects, as in Kuwaiti Gulf and UAE Gulf for Gulf Arabic). Sure, having only a few sentences in a language does not look very good, but all it takes is ONE dedicated user to add sentences in their language in order for it to thrive (Berber and Macedonian are our biggest examples of this). We have had many users who natively speak Gulf Arabic come and go, who weren't able to contribute in Gulf Arabic because we did not have it available; I believe that having only "Arabic" tends to imply "Modern Standard Arabic" and that is what most of our users have contributed in. If I were to get a job in the UAE and go live there, I would want to learn the UAE dialects of the Gulf Arabic Language rather than Modern Standard Arabic (though I would like to know SOME MSA). There are thousands of books in arb, but very few in afb (or abv or acx [see below]), why WOULDN'T we add them?

If you considered adding "Arabic Gulf" "language", then how could that differs if we are talking about a Bahraini dialect or an Omani dialect? Or even more, a sentence that is spoken in a specific town in Bahrain and another town in Saudi Arabia?

*The Bahraini Arabic language abv or the Omani Arabic language acx. We will gladly add these as well. And, if a variety is from a town in Saudi Arabia it would be in afb, acw or ars. If we get a chance, there is NO REASON why not to document their language. We have tags available here on Tatoeba to identify dialects, subdialects and geographical regions. Tatoeba also serves to document endangered or minority languages.

And one thing, just two sentences (among all of the ones in the list) are really "Gulf" and others are some other dialects.

These all came from someone in the UAE. And yes, some of them are valid as Iraqi Arabic or Levantine Arabic as well; I will let native speakers add them in their language, now that we have them available on Tatoeba. Since we need sentences in order to add a language, I had to add just a few; I have no plans to add any more and will leave it to native speakers of Gulf Arabic Language to continue adding.

There is a really little effort that is done for Arabic language, and we are creating and adding more and more dialects that split that effort into several dialect sentences that are useless and look bad.

Why not have both arb and ALL other Arabic languages? Why exclude any of them? There are loads of material on arb online, and thousands of books. If someone is ONLY interested in learning arb, they have so much material to work with. But, what if I wanted to learn Spoken Omani Arabic acx before going to Oman? Where will I buy my books? Tatoeba has huge potential in supplying linguistic information in underrepresented languages... as I said before, all it takes is ONE dedicated user to create a decent corpus for others to master their languages.

apparently many languages are identified as Arabic languages more than 30 in this ISO 639-3 http://www-01.sil.org/iso639-3/documentation.asp?id=ara

I hope we can add them ALL someday.

Do you think that it right to accept language request without consulting it's community?

Of course, if the language has a valid ISO 639-3 code. Otherwise, no. We get dozens of new users every day: some of who speak languages we haven't added yet. Most of THOSE will not know how to contribute in "unknown language" and what the procedure is to have their language added. Having contacted many of those users, I have guided them on the site and was able to have thousands of sentences added in dozens of languages. Gulf Arabic is one that constantly slips by us, and I could not let it go on any longer.

I hope you really understand the shock to us to have those dialects considered as languages while English with all it's dialects considered as one solid language.

TRANG answered this very well, but I will leave you with a link in case you wish to contact the authorities on the ISO 639-3 codes: sil.org/iso639-3 Having Arabic considered as one language is linguistically similar to having not just English, but ALL Germanic languages from German to Swedish to Icelandic to Faroese considered as a SINGLE language, or considering "Chinese" (including Mandarin, Wu, Yue, Hakka, Ping, etc.) as a single language. Some people could, but others would argue adamantly against this; language classification is not easy.

loolmeh commented 8 years ago

Also there are they Egyptian Arabic that you may find some users from Egypt that maybe interested in creating a language of their own as they did in Wikipedia.

Tatoeba added it years ago. I've briefly contributed to that effort.

I find the different between the Egyptian Arabic and other Arabic dialects is like the difference between the British, Scottish and Irish English.

Feeling they are that close and studying how close they are, are two separate endeavors. Tatoeba imho shouldn't dabble with catering for the feelings of a sociopolitical project or another. It's an issue of evidence and scientific study that Tatoeba will just have to delegate to qualified institutions.

+1 for adding afb

damascene commented 8 years ago

@cueyayotl First of all we do not agree that Modern Standard Arabic is the Arabic. Arabic is just Arabic which has the language code AR and it's 639-3 ara. not arb.

Apparently you do not understand that those are not called languages by Arabs. it's slangs. no grammars no official reorganization very few books if any. Every person in any Arabic country can write in the Arabic as it's the official language in schools and universities.

What you will do if you find a job in Ireland or South Africa? I'm quite sure you will not find your self speaking like the locals if you learn from sentences in Tatoeba.

As I suggested in https://github.com/Tatoeba/tatoeba2/issues/1079#issuecomment-204771184 you can just create something called Arabic slangs and put all the words you like there with their country tag if you like.

RyckRichards commented 8 years ago

According to SIL - the organizition which "regalutes" ISO codes and we've used to add languages (or not) says that Arabic has individual languages which are:

The individual languages within this macrolanguage are:

Algerian Arabic [arq]
Algerian Saharan Arabic [aao]
Babalia Creole Arabic [bbz]
Baharna Arabic [abv]
Chadian Arabic [shu]
Cypriot Arabic [acy]
Dhofari Arabic [adf]
Eastern Egyptian Bedawi Arabic [avl]
Egyptian Arabic [arz]
Gulf Arabic [afb]
Hadrami Arabic [ayh]
Hijazi Arabic [acw]
Libyan Arabic [ayl]
Mesopotamian Arabic [acm]
Moroccan Arabic [ary]
Najdi Arabic [ars]
North Levantine Arabic [apc]
North Mesopotamian Arabic [ayp]
Omani Arabic [acx]
Saidi Arabic [aec]
Sanaani Arabic [ayn]
Shihhi Arabic [ssh]
South Levantine Arabic [ajp]
Standard Arabic [arb]
Sudanese Arabic [apd]
Sudanese Creole Arabic [pga]
Ta'izzi-Adeni Arabic [acq]
Tajiki Arabic [abh]
Tunisian Arabic [aeb]
Uzbeki Arabic [auz]

Hopefully, we'll add all them

(Source: http://www-01.sil.org/iso639-3/documentation.asp?id=ara / http://www-01.sil.org/iso639-3/macrolanguages.asp )

Unlike English - http://www-01.sil.org/iso639-3/documentation.asp?id=eng

Ricardo Vernaut Jr

trang commented 8 years ago

I consider this topic closed.

Damascene stated in https://github.com/Tatoeba/tatoeba2/issues/1079#issuecomment-204771184 that the sentences in what he considers to be Arabic slangs cannot be mixed with Arabic. I've rejected the suggestion to add a "language" called "Arabic slang", therefore we'll be adding Gulf Arabic as a language.

Anyone who doesn't consider these "slang languages" to be actual languages can either simply ignore that they exist in Tatoeba and just contribute in Arabic (Modern Standard) or can contact the SIL in order to urge to them to review the way they categorized the Arabic language varieties. If the SIL decides to change their classification we will follow the changes.

loolmeh commented 8 years ago

Apparently you do not understand that those are not called languages by Arabs. it's slangs. no grammars no official reorganization very few books if any. Every person in any Arabic country can write in the Arabic as it's the official language in schools and universities.

I'm sorry you won't just speak for everyone like that and get away with it. I'm not sure why you feel the need to smother the linguistic heritage of anyone who doesn't sound purely arab enough to your liking. Here's a modest proposal. You're welcome to spend the next 40 yrs becoming a linguist and trying to convince the rest of the world's linguists that there's a language that hasn't changed one tiny bit in hundreds of years. We'll be here waiting once you succeed and won't break a sweat changing the codes.

damascene commented 8 years ago

@loolmeh are you an Arab? Do you have any scientific source identifying those as languages expect for the SIL awkward classification?

loolmeh commented 8 years ago

are you an Arab?

Awfully relevant question. Continue to prove my point.

Do you have any scientific source identifying those as languages expect for the SIL awkward classification?

Doing your due diligence on this front as well. I'll give you the benefit of the doubt. Start with Campbell's or Trask's book on historical linguistics. Then get Nader Jallad's book 'The Arabic language across the ages'. Then maybe you can dig through any decent search on google scholar. I'll lend you a helping hand: https://scholar.google.com/scholar?q=%22arabic%22+linguistics+change+origin+contact&btnG=&hl=en&as_sdt=1%2C5

One of the researchers in this field Lameen Souag btw has a blog here: http://lughat.blogspot.fr Literally hundreds of man years worth of careful work has gone into this. Don't expect a few minutes of handwaving on a github ticket to have any weight on this issue.

damascene commented 8 years ago

As you seem to know some authors who wrote in that subject can you please refer me to a specific book about this subject instead of an author names. The only book you mentioned has the title : The Arabic language across the ages so that seems to confirm what I say that there are no other Arabic languages. Just dialects.

I know that there are old Arabic and Modern Arabic and there are new words that entered the vocabulary but there are no new grammars and we still understand most of the text written in that language because we have the Quran that was written in Arabic 1400 years before that we still understand today Thanks to Allah.

loolmeh commented 8 years ago

That's ok I'm done debating this. If a few thousand comparative and historical linguistics papers aren't enough to even pique your interest or change your mind nothing can.

superlinux commented 8 years ago

Listen! my only comment is this: Following such false ISO standards is false. IT DOES NOT REPRESENT US ARABS WHATSOEVER. All this is just part of the plan to divide us as in the NEW MIDDLE EAST conspiracy which we all now know about. Please I ask you to stop using them. What I tell you next is not religion, it's just a mere pure scientific fact that we have ONE AND ONLY ONE LANGUAGE and it is ARABIC , which is the ARABIC OF THE QURAN. What we use formally in writing is Arabic. The spoken language is Arabic too, but borrows words from other surrounding environments, much like (as an example though) when an Englishman uses some French words to express some class. There is no Gulf Arabic nor Martian Arabic.. it's just Arabic.

Hopefully it's clear now!