UNMigration / HTCDS

Human Trafficking Case Data Standard
Other
18 stars 6 forks source link

HTCDS 1.0 just published while still confusing licensing (conflict with "open standard") + serious data exchange flaws for (at least) common persons name from Latin America and Asia #22

Closed fititnt closed 2 years ago

fititnt commented 2 years ago

I'm glad that the HTCDS 1.0 was just published this sunday.

While I, here as lexicographer, have no problem with the creators of HTCDS and the United Nations International Organization for Migration employee trying to intermediate on what to do, the lawyers responsible to give copyright advice are the real target of confusing license. Note: we are still trying to follow what license we use, and this is not clear.

Note that lawyers, who still have not replied to any request for clarification after months, if likely trying to copycat the failed model of ISO organization while obviously the HTCDS requires much more help because systems world-wide are incompatible. The problem is that trying to use ISO as role model is that ISO actively DMCA down any serious translation initiative even for COVID-19 response (with they "freely available in read-only format" only for Englsh/French), which makes it unfit for humanitarian usage where wrong translation kills people and there is no reference resource for an average developer who uses English to not create tooling that will fail when exchanging personal data.

1. Serious data exchange flaws for (at least) common persons name from Latin America and Asia

It seems that one requirement to be "compatible with off-the-shelf existing systems", a software from an US based company focused on marketing called Salesforce, actually makes data exchange with serious flaws. I will repeat what has already been said here https://github.com/UNMigration/HTCDS/issues/7#issuecomment-893968812.

A trafficked person with common names used in Latin America, if shared using the current standard proposed to UN IOM, will get an incomplete name. For names originally not writing in Latin script, in addition to name order be likely to be swapped, there not only one organization can share using original script, but there is more than one transliteration strategy, with makes data exchange of people from Asia much more likely to get wrong, because the reference of IOM is doing wrong.

Let me repeat: trafficked Latin Americans and Asians when exchanged using the current HTCDS 1.0 terminology are known to be specially lost. This reason alone is sufficient to care about, even if HTCDS persist as it is in English.

1.2. Why am I citing the issue already in English

I'm doing this in public not to shame current work on HTCDS (because this is a very important project, and in fact terminology like this is a generic need, and also because by creating Portuguese version, we're also criticized), but in case of lawyers keeping this conflict license, the work needed to "translate" (in special the generic salesforce fields) actually requires complete rework.

The closest existing work related to persons data is https://github.com/SEMICeu/Core-Person-Vocabulary. And most corner cases the SEMICeu/Core-Person-Vocabulary uses are very common on humanitarian data. Even if HTCDS 1.0 rename fields (which already would make it a 2.0) it would at least need to consider people who have more than one official name. (yes, one person actually can have more than one name; then add these people who have birth names in non-Latin scripts while having their name transliterated, and there is more than one way to transliterate names).

Even if we could comply with such a confusing license, the people who could help us from HXL-CPLP would need to try implementations outside the HTCDS repository, in particular how to deal with non-Latin written names.

2. Our approach to this conflict license

We from HXL-CPLP will release the concepts (extra descriptions, translations, examples, etc) and templates to build glossaries and data schemas under public domain. Actually it has been since HTCDS v0.2.

Cease-and-desist-letters (or ask for help from other implementers getting DMCA requests) are welcomed at rocha@ieee.org.

The standard HTCDS can keep whatever license is. But we here will not wait while people still bury their heads in sand, but we will not stop just because a thing known to get our names wrong just because it was easier to require a standard be compatible with a software used for marketing. I will explain why we will do public domain from our terminology while still making it reusable for humanitarian usage.

2.1. Why we, initial team from HXL-CPLP, refuse to allow copyright holding for all concepts to any organization

The way lexicography is done is close to words in a dictionary. We even developed software to convert not only to translator formats like XLIFF, but to TBX. This means our average spreadsheet is a frontend like Europe IATE https://iate.europa.eu/.

So, while maybe the name and the description of HTCDS could be copyrightable, concepts like how to break a person's name or birthdate are clearly insane any implicit implication with the current license that try to deny reuse for other cases. It's like trying to copyright a word in English. This is absurd.

2.1.1 Some quick context (for technical people, not the lawyers, to get idea of building blocks)

What for HTCDS is a standard (as a composition of words and definitions), in our case is a work break in concepts (as in concept-based translation, instead of term-based translation), that should have added explanations to aid translate differentiate ambiguous terms, then the final result could both be exported to create a glossary (like a PDF) or templated files where terms can be extracted back and generate something like a data schema or even scripts to convert data from one format to another. From the more "end user collaborator", what started with HTCDS 0.2, was this:

Every script is public domain and optimized to go from new terms to actionable scripts ready to be published at the speed needed in case of emergency response. This means, for example, if the same way new scripts/data schemas/documentations can be templated to new versions of something related to HTCDS, previous terminologies can be used to new implementations also related to humanitarian response.

There was one problem we empirically realized while doing technical translations around HTCDS 0.2 (that actually is reason why is harder to scale up translations, but since there is no one to do this, is unlikely this will even happen beyond English):

  1. This type of "translation" actually is a type of multilingual controlled vocabulary, which makes orders of magnitudes hard to bootstrap "translations" if initial concepts are not well planned (which, by the way, most fields based on Salesforce are beyond repair).2. And even if well planned, some terms are so new that this means introducing new terms to target languages (which makes step necessary as "provisional terms" years while actively publishing explanations on sites like Wiktionary and incentiving publication on traditional dictionaries). And, by "target language" this can even be English (like the case of decine how to break persons name)

Our point here is: serious terminological translation would need to also (as we did with software) allow glossary exportation for key terms and is inherently reusable. The HTCDS is actually just one of the items of our current Request For Feedback (https://docs.google.com/spreadsheets/d/1ysjiu_noghR1i9u0FWa3jSpk8kBq9Ezp95mqPLD9obE/edit#gid=846730778). We for example are aiming like the Lexicography of common terms used on COVID-19 data exchange (focus public data) (which also have other complaints like https://www.sciencediplomacy.org/article/2020/we-can-do-better-lessons-learned-data-sharing-in-covid-19-pandemic-can-inform-future).

2.2 Copyright over generic concepts are against promises needs to be done to volunteer translators/terminologist

Average person whilling to help not only are doing in good intent, but likely to actually either be a victim of human trafficking or know the real bad situation. Especially after this crisis of Afghanistan of interpreters left behind, I'm hearing a level so high of distrust that the bare minimum we can do for them is make sure things can be usable even assuming the current copyright holder will not be interested (or may be forced to DMCA down) their work.

Also, like I said earlier (about known issues with nomenclature that is flawed with common names in Latin America) translators are scared to the point that we have to rewrite the thing because it is beyond repair. Again: I'm not complaining about current people editing the HTCDS, because I know this is much more complicated. The point is that this would need much more help from outside and licensing is not clear enough if this already in next years will be "orphan work".

While we already were concerned with translation from HTCDS 0.2, we would like to point that even if IOM tries to go the fantastic reference, the standard UN M49 path, and have accredited translations to UN working languages, the Portuguese would not be one of them. That's one reason why I'm trying to make some workflow easier here.

This means that even if the current copyright holder of HTCDS tries it's best, we realistically would not have accredited translations like Hindi or Portuguese. This makes UN IOM announcing one standard while potentially denying (even if volunteers based, because of public interest) translations to, for example, Portuguese (like ISO organizations do), a threat to implement on countries like Angola, Brazil and Portugal. Remember: we're already stressed with ISO's way of "protecting" their standards, and saying out loud that current copyright holders don't have the infrastructure to have the Portuguese version is realistic. I'm not complaining, I'm just saying that we here would need a higher threshold. But as is the interest from everyone, there is no need to make it harder. Remember: average people willing to help in these subjects really care about.

2.3 Copyright over generic concepts on multilingual controlled vocabularies are against other UN agencies, Red Cross, Amnesty, etc

Except for concepts too specific to human trafficking, a lot of concepts of one project (as in forms to exchange data) are usable between other organizations. Salesforce fields are not a good reference. Having multilingual controlled vocabularies very well reviewed with implementers for generic terms like person's name is a serious need for humanitarian organizations. The IOM lawyers clearly don't know how bad other humanitarian organizations need this in particular for private data. Or, again, be inspired by the failed ISO approach.

Organizations like Amnesty have interest in existence of standards on sharing police cases as a way to allow strategies like identify police torture.

Implementers (like the ones who give aid) even use biometry because something as simple as a person's name is not standardized, which makes private data storage likely to put a lot of pressure on few developers. This means software writing in English gets wrong data even when inserted by information managers reading people official identification cards which leads to no alternative but collect biometry that is known to be pressured to be shared even with governments that may target the person later.

I could cite so many examples here were the lack of authoritative terminology is a problem beyond human trafficking data exchange. But, again, the point here from the view of a lexicographer is to maximize usage/reuse of vocabulary and even if the best approach would donate translations to the organizations who actually can endorse it, I'm not even sure donate work to HTCDS will be allowed to be reused by other humanitarian organizations.

End comments

Since even after HTCDS 1.0 still no license (and no clarification on what to do beyond no response), we will keep the drafts under public domain. This makes it reusable in the middle of such confusion.

The people behind this data are not numbers. If the English-Speaking community is so accustomed with this to a point of not at least make easier to other languages, seriously, just do the paperwork. Age or experience don't make this type of thinking any good role model for people from other regions.

VerenaSattler commented 2 years ago

Dear Emerson,

Many thanks for your comments and thorough feedback, which are much appreciated. As mentioned previously, IOM is working on updating the license and the current license has been a placeholder until we resolve this. We are also currently working on translating the standards and guidance into all UN languages, with a first set of translated materials uploaded in October. Please let me know if you would be interested in connecting bilaterally to further discuss some of the points raised above.

Best, Verena and team

fititnt commented 2 years ago

Update:

I will post a different GitHub issue about translations, but here some examples on the field about persons name.

Full personal name concept

in addition to the https://github.com/SEMICeu/Core-Person-Vocabulary (which have bigger details) there is another reference that may be relevant for translators, in special if need to keep just two fields.

On my research about best references used in practice, seems to be https://www.interpol.int/How-we-work/Forensics/Disaster-Victim-Identification-DVI (topic "Identification Form"). The "(s)" both in English, Spanish and French seems to help with people with multiple given names. I'm not 100% sure how Chinese (macrolanguage) would works, but the fact of have only two divisions of persons names (instead of Salesforce who can use up to tree, but only two are "core fields" on HTCDS 1.0) maybe actually be fine.

Captura-de-tela-de-2021-09-30-19-08-24 (1)

Parcial personal name concept (example: if who implement HTCDS would intentionally not want full name)

Just to say upfront, if one reason for not be sure enforce full personal name, and still need alternative for partial name, what could be done before start translations intentionally have two sets of alternatives on how to express personal name. Or a third one where explicitly allow implementer (like with interface hints) that they are free to use either full or partial naming.

One point here is that the ambiguity on name is not feasible when translating.

Extra: what if HTCDS still not sure what exactly approach use on fields

My suggestion is think of fields as building blocks.

The standard itself may endorse one default approach (to a point of today is the generated XLSX and etc). But translators (actually even the English version) can have a separate file with the options, and these options have authoritative versions in all natural languages.

This approach of separate what is the standard and what is the terms helps for example already ask translating concepts that MAY need, they are saved and shared, but are not endorsed. This could also helps in special if translators suggest new terms, but the addition of such terms would need reevaluation of privacy or something.

fititnt commented 2 years ago

FANTASTIC! 🎉

The main point of this was solved last Friday (commit reference: https://github.com/UNMigration/HTCDS/commit/57d3cad9b03bc9fca4aefe7b07503c8d169043e2 and https://github.com/UNMigration/HTCDS/commit/c928cd31efad2c0e9e251bf6020b207cf3cad3e0) with Attribution 4.0 International (CC BY 4.0)!!

Three points:

1. On the humanitarian usage

The comments over new license now fix the explicit issue on 2.3 Copyright over generic concepts on multilingual controlled vocabularies are against other UN agencies, Red Cross, Amnesty, etc for humanitarian usage under urgency on public repositories even by UN agencies without need to wait for IOM lawyers. Fan-tas-tic!

For external translation initiatives or republishes (something necessary to allow file formats or APIs ready to emergency use upfront), issues such as United States of America, Holder v. Humanitarian Law Project as pointed by ICRC can at worst lead to the removal of the English version. I'm not saying HXL-CPLP alone, but from time to time initiatives such as TICO-19 are proof that there are viable translations for dozens of languages under public domain. They even admit both on paper and on a podcast both about lack of friendly-license on source content and also admit that non-English can deliver better results than using initial material for every language variant.

I'm not sure if we from HXL-CPLP (as planned initially) will scale up translations directly or will take time and in some months improve general tooling and strategies for other major players or interested organizations to do such tasks.

1.1 On scale up translations

The people who bootstrapped the TICO-19 (which anyway already would likely do public domain dedication; TICO-19 had so many groups involved that something different than public domain would be complicated) maybe could do the same with HTCDS. I did not contact them explicitly, but will comment here. Doing "as best as possible" (without rushing like it was done in 2 weeks), if Google/Facebook (or any other big company) decides to take the main task of compile work, we're talking around 80-100+ languages, and most of them are not European languages at all, so there's plentiful room for human error between translators/reviewer until final derivable for public use. If we from HXL-CPLP started doing the same eventually we could hit some limits, then maybe an extra time could be done just to optimize issues when the amount of translations is already very high. Another reference with fewer translations (case for European Union) on SEMICeu/Translations-review-CoreVocs seems quite manual and this is harder for they atomate put on the published RDFs even after already having the translations.

Obviously we when contacting others would make clear that @HXL-CPLP / @EticaAI are not related with @UNMigration.

This potential idea started as we discovered just recently their initiative (we missed them from 2020) and we started to convert their final result to another format at https://github.com/EticaAI/tico-19-hxltm. I will give some context on challenges faced on TICO-19. However, note that TICO-19 was bootstrapped under urgency. Whole things such as HTCDS don't need this, some comments can be relevant since they're about how hard it is to manage back and forth.

First and really important: some organizers behind the companies complained about becoming exhausted to ever attempt to do it again. The closest to what would be terms on the spreadsheet would be "terminologies" (over 80 languages from Facebook, and over 100 from Google, both done by paid professional translators) and the source terms already were collected not with more context. While I do complain about one or another point of HTCDS, their terms were rushed, so it was already harder (not something done to make easier translations). However, the way data inputs would be used (and such companies would like to reuse them as there is no decent reference on thinks like how to label parts of person's name) could make them (the humans who could manage this again) more cautious.

Translator reviewers quite often criticized the initial professional translator so organizers had to decide what to do while not being able to evaluate themselves. So in the worst case, a main concern was the generated translations not be harmful. Add to this that some terms were so new that some languages would need to be introduced (not just look at a dictionary) and this is very likely to happen with some terms on HTCDS. Maybe even "human trafficking" doesn't have well defined term on several languages. If "UN Migration" appears on any part of the text, then this very likely doesn't have official translation on dozens of languages.

One main reason for translators making mistakes is not understanding source terms, which quite often generate "literal translations". One point we found while converting their files is that literal translations of abbreviations are unlikely to be detectable by whoever does the lexicography (e.g. compiling others' work) on different writing systems the organizer understands. Languages witting in non Latin script where transliteration of "Coronavirus"@English would be "Koronavirus", the professional translators quite often for "CV"@English instead of "KV" used "CV" (literal translation, letter per letter) and this gone undetectable and at least a translator and review. Then, from the point of who compile/organize results, a other issue was a few translations with wrong language codes (example: ar-AR "Arabic as in Argentina", es-LA "Spanish as in Lao People's Democratic Republic", but these is possible to automate common mistakes) and a case were the reviewer actually was native speaker of a very different dialect than the translator (they discuss this on Podcast). Note that both Facebook and Google organized far, far more translations in a short period of time (more than the European Union or UN would do using their own resources) and the mere fact of having professional translators is not sufficient for terminology. Sentences and paragraphs are easier. Important sort terms, very hard.

My idea is both have some time on @SEMICeu, in special Core-Person-Vocabulary and work a bit more on https://github.com/EticaAI/tico-19-hxltm (more the aspect of file conversions and tooling to detect human errors when compiling translations together). So at bare minimum, it would have planning ahead (both from source content being already on formats perfect go back and forth, but also content easier to scale up) aiming to mitigate part of their stress so not only the quality is likely to improve, but they actually would accept.

Further reading:

To finish this point: I think focus on best under circumstances (like the "CODs are 'best available' datasets") is the ideal approach over "perfect", then have means to allow each part annotate issues (maybe even have codes for these issues, so people not write always in text). Is better to not be scared that even the best professionals working on such tasks can not be sufficient. For example, even @SEMICeu, which in my knowledge is likely the best reference on it's area, is reviewing quality of older concept definitions and some more generic subjects are very hard. Then, since HTCDS is not urgent, to mitigate results of aim to suggest collaborators to do only the best under circumstances is define limits (like ones where it is better to have empty translations) and / or allow terminology-translators annotated as provisional translation and a reason. But such limits need to be agreed upfront. There are other points, but note that the way data is passed to translators by these companies and imported back may not store extra metadata, so they could need at least a few weeks, even dates are fixed. Note that this level of multilingual terminology intended to be used on data exchange is very rare and the close file format for this, TBX, de facto no one collecting such translations is using it, so make sense to us (even if they do not use some standard format) create an extension to import back to HXLTM, and then we do from everything else.

That's it!

2. On the side comment about English terminology for persons names

About the "(at least) common persons name from Latin America and ia" on title, this topic is more complicated than HTCDS, but at the same time worth eventually having strict curated translations to be reused.

We from HXL-CPLP (also after feedback from colleges to try first to cooperate with SEMICeu than do from scratch) are starting to propose some way which would allow reuse of terminology language variants planned from the start to be equally authoritative. A relevant issue there is this one on codes https://github.com/SEMICeu/Core-Person-Vocabulary/issues/39, which beyond other advantages, extend to more-than-translations the approach of using concept based translation instead of terms. This doesn't mean each group creating the data standards would need to have such codes, but they're useful when preparing the work for translations.

3. Thank to everyone here

I'm finishing this issue thanking everyone who done your bests to solve this licensing issue as fast possible 🥲