PhonologicalCorpusTools / CorpusTools

Phonological CorpusTools
http://phonologicalcorpustools.github.io/CorpusTools/
GNU General Public License v3.0
111 stars 16 forks source link

inventory of corpus with pronunciation variants #792

Open kchall opened 2 years ago

kchall commented 2 years ago

The inventory of a corpus is always based on the canonical pronunciations, not the full set of sounds in ANY pronunciation variants. So e.g. if your English canonical pronunciations contain /t/ and /d/, but your variants contain [ɾ], you can't search for [ɾ] or include it in your analyses. We need to allow inventories to be built from:

  1. the canonical forms only ('phonological' inventory)
  2. the variants only ('phonetic' inventory)
  3. a merged inventory of both ('total' inventory)
stannam commented 2 years ago

Suggested UI below.. (if no variants, grey out 'phonetic' and 'total' tabs)

image

  1. internally there should be 9 (=3*3) tables, one of which becomes visible according to the tab setting
  2. Questionː 'total' needed? (or, 'phonological' / 'phonetic' separation needed? if not, vertical tabs are not necessary?)
  3. need to revisit all instances where the 'inventory check' is done, and decide either phonetic or phonological inventory should be called.
  4. Other issues?

Internally...

We already have inventory categorization functionality, so I think we can use the same on variant pronunciations and create 'phonetic' table?? (I'm being optimistic) As for the 'total' table, combine 'phonological' and 'phonetic' and remove duplicates?

(And what is this? Seems like alternative inventories were tried in 2016?) https://github.com/PhonologicalCorpusTools/CorpusTools/blob/f1ba66511bdbbdb0659720a0d57143e5d3e76a05/corpustools/corpus/classes/lexicon.py#L2759-L2782

stannam commented 2 years ago

Also see #560 especially, https://github.com/PhonologicalCorpusTools/CorpusTools/issues/560#issuecomment-229464376

Inventory charts in the main window will always display the default transcription's inventory, and only the default inventory's table should be open for editing. In an analysis window, users should be able to see (but not edit) the inventory of alternative transcriptions. This can be accomplished by simply feeding the alternative inventory to the default inventory table's sort function. That is, we sort the alternative inventory based on the default inventory's row/column settings. Some alternative segments (possibly all of them) will end up in the uncategorized tab, but there's not much to be done for that right now. I think this can be a relatively quick fix.

Seems like it is intentional to only contain default segments in the main inventory chart. Alternative inventory chart should not be editable but can be accessed in analysis functions?

stannam commented 2 years ago

Two txt files are added to Dropbox. Find them in Phonological_CorpusTools_Public/example_files/variants/variants in inventory

kchall commented 2 years ago

Our current interim solution does show all symbols (phonetic or phonological) in a 'master' inventory table. Searches based on these symbols return 0, but there is a note to that effect in the search dialogue box.

However, there are two further problems: (1) All analyses have the same issue as searches (e.g., if you try to calculate functional load based on minimal pairs for [t] / [ɾ] in the above corpora ('writing' vs. 'riding'), the result is 0. This is likely to be true of ALL analyses. (2) If you pull up the 'corpus summary' inventory and click on a symbol that happens to occur only in phonetic variants (e.g. [ɾ] or [kʰ] in the above corpora), PCT crashes outright with no error message (instead of giving either the actual type / token count or 0).

Given these issues, I actually think we should 'roll back' the commit that added symbols that appear only in pronunciation variants to the total inventory, and simply clarify in the documentation that currently, only canonical pronunciations are used to populate the inventory and hence can be used in searches and analyses.

@stannam maybe we could add instead a note on the 'corpus summary' dialogue box that says "Note that this inventory is based on only the symbols that occur in canonical pronunciations. PCT does not include symbols from pronunciation variants in the inventory, and such symbols cannot currently be directly searched for or used in analyses."

Thanks, and sorry for the hassle! :(

kchall commented 2 years ago

Hmm. I see that the corpus summary window is updated with the suggested note. But, it looks like PCT is still pulling in the inventory from pronunciation variants, so it's not quite rolled back. E.g.:

Load 'variant_inventory_ilg' corpus. The symbol [ɾ] appears only in the pronunciation variants of 'riding' and 'writing,' not in their canonical forms. Go to Corpus > Summary. [ɾ] appears in the inventory (and it shouldn't). Clicking on it causes PCT to crash.

If instead you go to Corpus > Phonological search, again, [ɾ] occurs in the inventory; searching for it returns a count of 0.

(And basically the same thing happens in variant_inventory_csv, though of course there the [ɾ] is in the phonetic transcription column, not stored as a pronunciation variant.

stannam commented 2 years ago

That is strange. On my end, the chart only contains canonical segments in the summary window and other places including analysis functions and Features > Manage inventory chart.

Can you try to load the .txt files again? I used the two files in example_files/variants/variants in inventory.

kchall commented 2 years ago

Ah! Yes...again, silly on my part. I was reloading the existing corpora instead of creating them from scratch. Looks good, thank you.