This PR updates the eaf_from_chat function so that it
opens the .cha file with the appropriate codec (as opposed to string decoding line by line)
handles missing annotation time stamps (occurs in some .cha files in phonbank)
handles continuation lines, e.g., *CHI transcripts which are broken across multiple lines
Other changes to pass tests
The extract function probably should have been using the mid point of an annotation start & end
The test for the extract function was not inclusive of annotation start time (I believe the function is intended to be inclusive).
Github's actions/setup-python@v2 no longer supports python 3.5 or3.6
How it works
If no codec is supplied, we try to check whether there is a @UTF8 directive. Finally, we open the file in whatever codec the function has resolved.
We read all the lines into memory and concatenate continuation-lines
If a main transcript line does not include a timestamp, we use the previous annotation's timestamp (because it is proximal) and if there is no previous annotation, we use times: (0,1)
Testing
I tested this new script with all .cha files from the Manchester & Providence corpora.
Purpose of this PR
This PR updates the
eaf_from_chat
function so that it.cha
file with the appropriate codec (as opposed to string decoding line by line).cha
files in phonbank)Other changes to pass tests
How it works
@UTF8
directive. Finally, we open the file in whatever codec the function has resolved.Testing
I tested this new script with all
.cha
files from the Manchester & Providence corpora.Test script:
There appear to be some tests which do not pass, but this does not seem to be the result of changes made in this commit.