bible-technology / scripture-burrito

Scripture Burrito Schema & Docs 🌯
http://docs.burrito.bible/
MIT License
21 stars 14 forks source link

Specify what countries means #94

Closed rdb closed 4 years ago

rdb commented 4 years ago

What does "countries" mean? The countries in which the production of the burrito content is taking place? Or the countries that the translation work is intended to be targeting? The documentation currently only specifies that it "contains one or more country elements", which is not particularly helpful.

In some cases this is very obvious, but not so for projects in languages that aren't highly regional. I remember one extreme case in the Registry wherein a user added every country in existence to the metadata of an Esperanto project, resulting in a large unwieldy list. In the Registry, we "solved" this by introducing a ZZ country code indicating that the work was not targeting a particular country.

It's not entirely clear whether the recommended solution in SB is that it should simply be omitted, because while it states that the field is "optional", it also states that there have to be "one or more" elements.

A hypothetical consideration could be that we rename it to "regions" and permit the entry to contain UN M.49 region codes, similar to how the same is possible in BCP47 tags, which would allow targeting eg. "Latin America" or "World".

mvahowe commented 4 years ago

I have no idea how, if at all, this field is actually used. I have always seen it as specifying the target country. As you say, this does get messy for projects that may be used in many countries (anything in Spanish for example).

In general, the rule in the schema is (or should be) that fields may be optional but should never be entry. In other words, there doesn't have to be a countries element, but if there is it must contain something.

mvahowe commented 4 years ago

I wonder if we need this field at all. Specifying target country is messy (especially if the answer is "Latin America" or "French-speaking Africa") and that concept seems increasingly slippery anyway with digital publishing and online stores.

I wonder if this field was originally intended by someone to distinguish, eg, African French vs Quebec French vs French French. If so, I think this is precisely the sort of problem that BCP47 is supposed to solve for us.

@klassenjm Any ideas about the history of this field?

klassenjm commented 4 years ago

@mvahowe It is messy sometimes. The intention was to identify the country of the primary target audience / language speakers. Of course there are diaspora for many languages, but a Bible translation produced for German speakers by the German Bible Society would indicate Germany here.

With that said - I agree that this is awkward sometimes. The result has been to specify something like "Americas Area" for some entries.

mvahowe commented 4 years ago

@klassenjm Ok, so maybe we should add in some codes for continents and regions?

klassenjm commented 4 years ago

Yes, that could help.

mvahowe commented 4 years ago

It looks like UNM49 would be the way to do regions and continents. (Apparently BCP47 uses these codes for regions/continents and ISO 3166 for countries.)

https://en.wikipedia.org/wiki/UN_M49

jag3773 commented 4 years ago

Can we make it optional? Would that help?

mvahowe commented 4 years ago

UNM49 regions plus ISO-3166 countries is exactly what BCP47 does, so I think we should remove this field and put the information, optionally, into the language BCP47 field.

jag3773 commented 4 years ago

I'm in favor of following BCP47.

rdb commented 4 years ago

BCP47 only allows specifying a single region or country that the language is spoken in, and only when it's needed to disambiguate a variant of a language. So we'd be throwing a large amount of data away by just going with BCP47. Data that's in our case regularly used to determine where translation projects are taking place. (Though admittedly, one of the biggest uses we have for it is detecting when people have specified an incorrect language code).

We could probably make up for some of the loss of information by making use of the Ethnologue API to determine which countries a language is spoken in, but besides the fact that this isn't a freely available API, this doesn't always perfectly reflect the audience of a translation project. A translation may be targeting a particular geographical area even if the language is one that is spoken more widely, and there may be aspects other than language that may influence choices made in the translation (such as cultural differences).

In any case I would think we should do due diligence to make sure that all the data we're throwing away isn't important to some people.

mvahowe commented 4 years ago

@rdb Ok, in that case how about allowing the region codes as well as country codes in the countries section? That way we have BCP47-compatible codes in that section.

rdb commented 4 years ago

I don't know what "BCP47-compatible" means in this context, but would be fine with taking a page out of the BCP47 book and allowing UN M49 region codes here.

Should we rename it to "regions", then?

jag3773 commented 4 years ago

Rereading the original post... This issue is to specify what is indicated by this field, which I think @klassenjm did for us from his perspective. This is target country/region information. However, @rdb noted that the data is used for determining "where translation projects are taking place."

It doesn't seem like one field should be used to identify both of those things. Allowing for regions doesn't seem to solve the OP.

mvahowe commented 4 years ago

@jag3773 I was taking @klassenjm 's answer to the "what is indicated" question. I don't have a clear distinction in my head between 'field' and 'where translation projects are taking place', and I'm sure we'll find that there as many perspectives on that distinction as there are organisations. 'Where translation projects take place' feels to me like an ecosystem sort of question, like discussions about the evolving project. I believe that @rdb was pointing out that this field has been used as an ad hoc integrity check.

The regions thing is another, related issue that ends up making this field more obscure. So, eg, someone produces a Spanish translation in Equador. For Jeff's use of this field, ISTM that they should really list every country in Latin America that speaks Spanish. They don't do this because it's tedious, and mad, and maybe because they didn't fund this translation for use in Paraguay, even though people in Paraguay will be using it as soon as it hits YouVersion. If everyone completing this field goes through this kind of process, the field becomes useless. Having the option to specify regions can only help, IMO.

jag3773 commented 4 years ago

I agree that regions is a reasonable set to add. However I think that the specification needs to be clear as to what the countries field means, like the original post asked. Otherwise we will have even more divergence in terms of how it is being used then previously. This is probably a documentation issue, simply noting something like "expected geographical target of the translation."

rdb commented 4 years ago

The reason why I opened this issue was indeed to get us to decide what it means. Right now, nobody is saying what it means, so people are filling it in according to however they interpret, which could range all the way from "where the translation project is based" and "location of target audience" to "countries where the language is spoken", and in most real-world cases these definitions are the same or overlap, but not all.

Saying that one of those ways it's used overlaps with BCP47 and then removing the field would mean that we are implicitly defining this field to something that is already redundant. I'm saying that we shouldn't do that, but instead should decide on some reasonable definition that makes sense (and preferably doesn't throw all our data away) and that we can tell people to try to stick to. :)

Regions are nice to have, but off-topic for the original issue.

mvahowe commented 4 years ago

@rdb My suggestion above is to go with Jeff's definition of

the country of the primary target audience / language speakers

If that's the definition, I'd say that regions are pretty much essential to avoid people adding random subsets of about 30 countries in the case of major languages.

mvahowe commented 4 years ago

I suggest we talk about this. We agree that the current field is ambiguous. The obvious solution is to add more fields. But I'm not convinced that's useful, since it creates one more migration headache, and since it makes the user education task harder, not easier. (We could end up with two or three fields, each containing untrustworthy data, instead of one.)

I suggest we go with Jeff's definition (and add regions) but, as I say, let's talk about it.

jag3773 commented 4 years ago