PhonologicalCorpusTools / CorpusTools

Phonological CorpusTools
http://phonologicalcorpustools.github.io/CorpusTools/
GNU General Public License v3.0
111 stars 16 forks source link

add surface-only segments to Buckeye2hayes.feature #793

Closed stannam closed 2 years ago

stannam commented 2 years ago

I noticed that several sounds in the Buckeye Corpus only appear in surface_transcription. PCT's Buckeye2hayes.feature does not contain them, so they cannot be categorized, even after now we can add variant segments to the inventory (re: #792 ).

image (NB: not categorizing 'hh' [h] and 'r' [ɹ] is expected and not a bug. )

Presumably, 'awn,' 'ihn' and 'own' are nasalized vowels. They are not included in Buckeye2hayes so far because I only added symbols found from the Buckeye documentation. According to the Buckeye documentation, they are 'Phones added/relabeled during hand labeling' and nasalized vowels are some of them. We need to check all the Buckeye files and add those symbols only found at the surface.

stannam commented 2 years ago

... and perhaps CSJ too

image

Although a corpus gets created without errors, this does not look right. But I don't know how CSJ text files are formatted, so I might be importing them wrong.

CSJ text files and csj2hayes.feature are in the dropbox folder.

(related: #665)

kchall commented 2 years ago

Re: CSJ -- I don't think this has been read in correctly. You want to make sure that (a) you don't use comma as the default segment delimiter (which PCT often wants to do with these files) and (b) you include the multi-character sequences from "csj_digraphs.txt" (also in the Phonological_CorpusTools_Public/TRANS folder).

I don't think that there are canonical vs. surface transcriptions in the CSJ -- just one pronunciation tier in the original textgrids, called "Seg," which I think is what is pulled out into the .txt file versions of the corpus. At any rate, the .txt file versions are just running text, not even interlinear glosses, so there shouldn't be issues of pronunciation variants there.

The Buckeye corpus, though, does have this issue.

stannam commented 2 years ago

Todo:

Unexpected issues:

stannam commented 2 years ago

not just Ṽs. s0703a contains other surface-only symbols that are not recognized.

It seems like I really need to run through all data. But I'll be putting this off (possibly until after the release).

image

image

kchall commented 2 years ago

p. 22 and 23 of the manual have all the things that are 'supposed' to be labeled: https://www.dropbox.com/s/57bl7ail6h6nvht/Buckeye_Corpus_manual.pdf?dl=0

Anything else I think we can just leave out. And I think that things like the above (IVER) are errors in the transcription -- it should have been labeled as and marked as non-speech.

We don't actually distribute the Buckeye corpus with PCT -- just the feature system for the symbols, and so I think it's fine for it to be just the symbols they say are included. People can be in charge of cleaning up their copies of the corpus themselves (or not, and just accepting [n] feature values!).

stannam commented 2 years ago

I checked and confirm that PCT covers everything on pages 22-23. Both errors I marked above seem to be from mistakes.