PhonologicalCorpusTools / CorpusTools

Phonological CorpusTools
http://phonologicalcorpustools.github.io/CorpusTools/
GNU General Public License v3.0
111 stars 16 forks source link

Question re: Interlinear files #318

Closed kchall closed 9 years ago

kchall commented 9 years ago

So, Avery and I are working on getting more of the Gitksan files turned into a corpus. As a general principle, it would be good to be able to link orthography to phonetic transcription, anyway, but in Gitksan it turns out it is particularly important, because the orthography is not only fairly transparent, it is largely allophonic as compared to the transcription, which is more phonemic! So in fact, it would be useful to actually run many of our analyses on the orthographic representations instead. But, here's the question: how do we get PCT to LINK the orthographic and phonemic representations? (I mean, we could create separate corpora for the two, but that's not an ideal solution.) They are currently done interlinearly -- how easy would it be to have PCT be able to read those in, alternating reading in to the spelling list and into the transcription list? (Obviously it's on us to make sure the word breaks are in the same places, but assuming they are, can we make this another priority?)

@mmcauliffe @jsmackie

mmcauliffe commented 9 years ago

I don't think it would be too bad, provided they were plain text files and not docs or something. Could you put some example files on dropbox?

kchall commented 9 years ago

Great, thanks. I put some samples in "Sample_Interlinear_Texts" in the test files folder of Dropbox.

mmcauliffe commented 9 years ago

So it's basically working on the ilg-loading branch (where I collapsed orthography and transcription together with it into a single "Load from Running Text" dialog, since they overlap a lot), but some of the more advanced things should probably get worked out at our next meeting, maybe. For instance:

What about translation lines? At the moment it ignores every line starting with ", but that's probably not the best solution. Maybe a checkbox indicating whether it includes translation lines (and then grab the first and second line of every three lines or something?)

Should punctuation be ignored for transcriptions as well as spelling?

What about glosses (might be cool to support)? But to really do it well, we'd probably need a lot better support for morphologically complex things.

kchall commented 9 years ago

Cool, thanks, Michael! A couple of things:

-when trying to add a digraph, I got the following error: Traceback (most recent call last): File "/Users/kchall/Desktop/CorpusTools/corpustools/gui/widgets.py", line 487, in construct possible = sorted(self._parent.characters - minus, key = lambda x: x.lower()) AttributeError: 'CorpusFromTextDialog' object has no attribute 'characters'

-when the corpus actually loads, the transcriptions aren't showing up in the corpus section, just in the running text section:

interlinear_01

-we should also add parameters for things like "which line shows orthography?" "transcription?" "translation?" (they're usually going to be in that order, but we shouldn't assume it...)

mmcauliffe commented 9 years ago

I'm going to close this issue now since the ILG branch has been merged in and the basic functionality is there, but feel free to open more specific issues related to the behaviour of loading ILG!