add surface-only segments to Buckeye2hayes.feature

PhonologicalCorpusTools / CorpusTools

Phonological CorpusTools

http://phonologicalcorpustools.github.io/CorpusTools/

GNU General Public License v3.0

111 stars 16 forks source link

add surface-only segments to Buckeye2hayes.feature #793

Closed stannam closed 2 years ago

stannam commented 2 years ago

I noticed that several sounds in the Buckeye Corpus only appear in surface_transcription. PCT's Buckeye2hayes.feature does not contain them, so they cannot be categorized, even after now we can add variant segments to the inventory (re: #792 ).

(NB: not categorizing 'hh' [h] and 'r' [ɹ] is expected and not a bug. )

Presumably, 'awn,' 'ihn' and 'own' are nasalized vowels. They are not included in Buckeye2hayes so far because I only added symbols found from the Buckeye documentation. According to the Buckeye documentation, they are 'Phones added/relabeled during hand labeling' and nasalized vowels are some of them. We need to check all the Buckeye files and add those symbols only found at the surface.

stannam commented 2 years ago

... and perhaps CSJ too

Although a corpus gets created without errors, this does not look right. But I don't know how CSJ text files are formatted, so I might be importing them wrong.

CSJ text files and csj2hayes.feature are in the dropbox folder.

textfiles are in Phonological_CorpusTools_Public/example_files/CSJ_sample_corrected/CSJ_text_sample
feature file is in Phonological_CorpusTools_Public/TRANS

(related: #665)

kchall commented 2 years ago

Re: CSJ -- I don't think this has been read in correctly. You want to make sure that (a) you don't use comma as the default segment delimiter (which PCT often wants to do with these files) and (b) you include the multi-character sequences from "csj_digraphs.txt" (also in the Phonological_CorpusTools_Public/TRANS folder).

I don't think that there are canonical vs. surface transcriptions in the CSJ -- just one pronunciation tier in the original textgrids, called "Seg," which I think is what is pulled out into the .txt file versions of the corpus. At any rate, the .txt file versions are just running text, not even interlinear glosses, so there shouldn't be issues of pronunciation variants there.

The Buckeye corpus, though, does have this issue.

stannam commented 2 years ago

Todo:

[x] In the Buckeye feature system, attch 'n' to all vowels for the nasal vowel (e.g., 'aw' + 'n' -> 'awn'). As for their feature values, Vn segments should inherit all feature values except for [nasal], which should be [+nasal].
[x] Do the same for Buckeye2spe.feature

Unexpected issues:

'''[ɚ̃]'''. ipa2spe.feature does not contain [ɚ], so we cannot add its nasalized version.
- If I recall correctly, the SPE book does not discuss [ɚ] and so we simply drop this in buckeye2spe.feature. ([ɚ] and [ɚ̃] are in buckeye2hayes by the way..)
- This does not raise any errors but I think we can say something in the documentation about [ɚ].

stannam commented 2 years ago

not just Ṽs. s0703a contains other surface-only symbols that are not recognized.

It seems like I really need to run through all data. But I'll be putting this off (possibly until after the release).

kchall commented 2 years ago

p. 22 and 23 of the manual have all the things that are 'supposed' to be labeled: https://www.dropbox.com/s/57bl7ail6h6nvht/Buckeye_Corpus_manual.pdf?dl=0

Anything else I think we can just leave out. And I think that things like the above (IVER) are errors in the transcription -- it should have been labeled as and marked as non-speech.

We don't actually distribute the Buckeye corpus with PCT -- just the feature system for the symbols, and so I think it's fine for it to be just the symbols they say are included. People can be in charge of cleaning up their copies of the corpus themselves (or not, and just accepting [n] feature values!).

stannam commented 2 years ago

I checked and confirm that PCT covers everything on pages 22-23. Both errors I marked above seem to be from mistakes.

The IVER in 'which' arose because is not in s0703a.words where needed (between lines 610 and 611).
As for 'being,' 'e' should be omitted from the surface transcription 'eng'...?