PhonologicalCorpusTools / CorpusTools

Phonological CorpusTools
http://phonologicalcorpustools.github.io/CorpusTools/
GNU General Public License v3.0
111 stars 16 forks source link

[BUG] cannot load textgrid #798

Closed stannam closed 2 years ago

stannam commented 2 years ago

Describe the bug Loading a folder Traceback (most recent call last): File "corpustools\gui\iogui.py", line 95, in run File "corpustools\corpus\io\pct_textgrid.py", line 418, in load_directory_textgrid File "corpustools\corpus\io\pct_textgrid.py", line 335, in load_discourse_textgrid File "corpustools\corpus\io\pct_textgrid.py", line 206, in textgrid_to_data TypeError: 'NoneType' object is not iterable

Loading a single textgrid file Traceback (most recent call last): File "corpustools\gui\iogui.py", line 97, in run File "corpustools\corpus\io\pct_textgrid.py", line 335, in load_discourse_textgrid File "corpustools\corpus\io\pct_textgrid.py", line 206, in textgrid_to_data TypeError: 'NoneType' object is not iterable

Sample corpus file s33m44m6.txt or a folder of such textgrid files

To Reproduce

  1. Go to 'Import corpus'
  2. Click on 'Textgrid' tab -> choose directory -> select a directory of textgrid files OR
  3. Click on 'Textgrid' tab -> choose file
  4. Click Ok
  5. See error

Expected behavior A corpus should be created

Operating system and PCT version

Additional context

stannam commented 2 years ago

I think I figured out the issue.

The python 'textgrid' package we use requires a label in the original .TextGrid file. Error arises when PCT passes a name which is not one of the original tier names.

If the user changes the 'name' for the transcription tier, the function cannot find the right tier in the original file. For example, if in the original textgrid file the transcription tier is labeled 'pronunciation', the user needs to change it into 'transcription,' and since the original file does not have a tier with the 'transcription' label, PCT raises error. For an independent reason, we have made it mandatory for a transcription tier to be named 'Transcription.'

kchall commented 2 years ago

Hmm, I'm still getting an error when trying to import the WebMaus sample TextGrids. I first tried by keeping the transcription tier labelled 'KAN' and got the usual error saying that it needs to be called 'Transcription.' Once I changed it, I got a different error:

Traceback (most recent call last): File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/iogui.py", line 97, in run corpus = load_discourse_textgrid(**self.kwargs) File "/Users/KCH/Desktop/CorpusTools/corpustools/corpus/io/pct_textgrid.py", line 355, in load_discourse_textgrid discourse.lexicon.specifier = modernize.modernize_specifier(discourse.lexicon.specifier) File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/modernize.py", line 110, in modernize_specifier features = sorted(list(specifier.matrix[seg].keys())) AttributeError: 'Segment' object has no attribute 'keys'

stannam commented 2 years ago

That is strange.

I successfully imported the WebMAUS files in the public dropbox folder (example_files/TextGrid_sample). I tested on 'WebMAUS_English_story_123.TextGrid' and 'WebMAUS_English_story_123_renamedtiers.TextGrid.' As for the settings in the dialog window, I just followed the guide in readme.xslx (also in the same folder).

kchall commented 2 years ago

I'm still having issues with this:

  1. On the current master branch and trying to import 'WebMAUS_English_story_123.TextGrid'. I started with no changing of names, and using settings I think should work to allow pronunciation variants:
image

Unsurprisingly, I get an error about the transcription tier name:

image
  1. So then I tried everything the same, but just changed KAN to Transcription:
image

Not surprisingly, I get an error about unknown symbols:

image

But, the good news is that the corpus will be created and then load! But the pronunciation variants aren’t properly accounted for: e.g. 'Sarah' has 11 tokens, and at least one of them is not canonical (missing the first [s]), but this is not shown:

image
  1. If I use the recommended parsing settings in the readme.xlsx file (not sure where these came from) -- the key difference is in the treatment of the second transcription tier as being an ignored "notes" tier:
image

Then I can't get the corpus to load at all, and I get the error message I posted last time:

Traceback (most recent call last): File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/iogui.py", line 97, in run corpus = load_discourse_textgrid(**self.kwargs) File "/Users/KCH/Desktop/CorpusTools/corpustools/corpus/io/pct_textgrid.py", line 355, in load_discourse_textgrid discourse.lexicon.specifier = modernize.modernize_specifier(discourse.lexicon.specifier) File "/Users/KCH/Desktop/CorpusTools/corpustools/gui/modernize.py", line 110, in modernize_specifier features = sorted(list(specifier.matrix[seg].keys())) AttributeError: 'Segment' object has no attribute 'keys'


So, I think there are at least two separate issues:

  1. Pronunciation variants aren't being read correctly if transcriptions are allowed to vary within lexical item. @stannam can you see if this is a problem on your end as well?
  2. I am surprisingly unable to use the recommended settings in the readme.xlsx file, which would just ignore the second transcription tier entirely. That isn't what we want to be doing to get the right transcriptions in this case, but it's not good that I can't actually do it if I want. This seems to be a difference between Mac and PC?
stannam commented 2 years ago

I repeated your steps, and strangely (?) I cannot replicate either of these issues on Windows and my virtual mac machine (Monterey, version 12.2).

I get two varaint forms for 'Sarah.' image image

And the corpus is created successfully with the MAU tier ignored. image

One difference I noticed is that the symbols warning message from my end does not contain 3, a, e, and u. image

I am running on the most recent codes, i.e., the version dated 2022-02-21 in the 'master' branch. Could you double check if you are synced to the latest version? image

stannam commented 2 years ago
kchall commented 2 years ago

This seems to have been an issue with the feature file -- it works if I use the sampa2hayes feature file, and now that I've updated my arpabet2hayes file, it also "works" with that insofar as I get the expected message about missing characters, but then the corpus still gets created and loads with pronunciation variants.