[Transcription System] X-SAMPA and/or Kirshenbaum

tresoldi commented 6 years ago

Should these systems be considered? They would be good candidates for inclusion if we plan to develop a method to recognize an unknown transcription system. They might also be an alternative for situations in which Unicode is still not acceptable, like a fall-back.

The inclusion would be straightforward, a preliminary mapping can be done simply by using the Wikipedia articles:

LinguList commented 6 years ago

Yes, I was thinking about that. But given that there's a JS-sampa application which I regularly use (and which people could use to insert data), I was asking myself if it is needed in the end, since, SAMPA is more thought of as a system that renders to a certain subset of Unicode symbols using ASCII-chars, right? One could, however, also just test how far we can go with this. I recommend for rendering, however, to check both the lingpy-sampa symbols, and BXS.vim, which I further extended over the last years:

bxs.vim.txt

In fact, one could probably use this to replicate BIPA in Sampa-form, maybe a good idea?

tresoldi commented 6 years ago

I was thinking about adding them so that clts would not need to rely, not even on the web interface, on such tools (which are good and useful, of course). The benefits of a single programmer interface don't need to be mentioned -- I am actually being a bit selfish here, as I am starting to rely on clts while developing my model.

A full clts integration would also bring some advantages, such as being able to translate a feature description into an X-sampa "grapheme". It would also provide some additional statistics, which is always good.

I could take care of this once I finish UP A and Ruhlen, if you want.

Em 5 de jan de 2018 1:03 PM, "Johann-Mattis List" notifications@github.com escreveu:

Yes, I was thinking about that. But given that there's a JS-sampa application which I regularly use (and which people could use to insert data), I was asking myself if it is needed in the end, since, SAMPA is more thought of as a system that renders to a certain subset of Unicode symbols using ASCII-chars, right? One could, however, also just test how far we can go with this. I recommend for rendering, however, to check both the lingpy-sampa symbols, and BXS.vim, which I further extended over the last years:

bxs.vim.txt https://github.com/cldf/clts/files/1606915/bxs.vim.txt

In fact, one could probably use this to replicate BIPA in Sampa-form, maybe a good idea?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cldf/clts/issues/84#issuecomment-355575682, or mute the thread https://github.com/notifications/unsubscribe-auth/AAar99_fybWutaJiDt2xVq7VbrEjyTTyks5tHjnBgaJpZM4RUdHk .

LinguList commented 6 years ago

Yes, excellent! I agree that having this would also facilitate handling for those who use the python interface.

LinguList commented 6 years ago

The more I think about it, the more I think that SAMPA is not a transcription system, but a transliteration for IPA. What we can consider instead, maybe, is to take the function for sampa2uni from lingpy and plant it into the util or another part of the package, to make clear that sampa-conversion is the task of CLTS, but not treat it as a transcription system. Since SAMPA can be further extended to cover more than the usual symbols, we could even then add a sampa-keyword to all transcription system methods, and would just be able to query strings using SAMPA, but not forcing it to be used as a full-fledged way to transcribe things. LingPy's sampa2uni-function in fact handles most cases we would expect, so it would be straightforward to just take it from there and later kick it out of lingpy.

xrotwang commented 6 years ago

That's a bit philosophical, isn't it? What would be the difference between "transliteration of IPA" and "transcription system"?

When you say

LingPy's sampa2uni-function in fact handles most cases we would expect

then that's exactly the problem CLTS should solve: I.e. turning implicit, possibly complex code into declarative, transparent data. So resorting to using the old code after all, when the aim was to describe what it actually is that "we expect", seems like failure.

LinguList commented 6 years ago

But the essence of sampa is completely different from the esssence of transcription systems. The parsing algorithm needs to be different, since SAMPA does not show the distinction between diachritics and base characters, but instead uses sub-characters to turn a base character into a diacritic. As a result, you cannot parse the grapheme t_h in sampa using our current transcription system code (which works well for other systems), simply because sampa was created to translate from ascii-glyphs (as opposed to graphemes) to unicode-glyphs. For sampa, all you need is a look-up-table, or an orthography profile (maybe even better!) to turn it to Unicode-IPA. Our pre-defined transcription system code, however, won't work on it, unless we spell out all characters.

xrotwang commented 6 years ago

Ah ok, I see. So basically, SAMPA is orthography, thus should be handled via orthography profiles. That's ok, since this uses a different, but also well-described and transparent mechanism :)

LinguList commented 6 years ago

Yes, I think this is the best way to go: we make a huge orthoprofile (no need to use lingpy's algo) with all 6000+ symbols, converting them to sampa where possible, provide it as an orthography profile and allow to load it quickly. I am just wondering: as ortho-profiles are in some way important for CLTS, should we consider putting the sampa-profile into the segments-package, or rather into the clts-package?

xrotwang commented 6 years ago

I guess it would make sense in the segments package, considering that SAMPA is one of the most prominent ways to write IPA. This would increase the immediate usefulness of the segments package, beyond "just" proper UNICODE tokenization.

LinguList commented 6 years ago

Okay. I suppose we transfer this issue now and make the argument that with help of our ~ 6000 segments, we could easily just provide a huge number of possible segmentations even of SAMPA (maybe excluding clusters, as they will mess it up), and it can be included in the next release of segments. At the same time, we may also use our 6000+ graphemes in our BIPA system to produce an orthography profile that could at some point be used instead of lingpy's segmentation algorithm.

cldf-clts / clts-legacy

[Transcription System] X-SAMPA and/or Kirshenbaum #84