Jt/py3 chat parsing - Githubissues

Purpose of this PR

This PR updates the eaf_from_chat function so that it

opens the .cha file with the appropriate codec (as opposed to string decoding line by line)
handles missing annotation time stamps (occurs in some .cha files in phonbank)
handles continuation lines, e.g., *CHI transcripts which are broken across multiple lines

Other changes to pass tests

The extract function probably should have been using the mid point of an annotation start & end
The test for the extract function was not inclusive of annotation start time (I believe the function is intended to be inclusive).
Github's actions/setup-python@v2 no longer supports python 3.5 or3.6

How it works

If no codec is supplied, we try to check whether there is a @UTF8 directive. Finally, we open the file in whatever codec the function has resolved.
We read all the lines into memory and concatenate continuation-lines
If a main transcript line does not include a timestamp, we use the previous annotation's timestamp (because it is proximal) and if there is no previous annotation, we use times: (0,1)

Testing

I tested this new script with all .cha files from the Manchester & Providence corpora.

# 36906ceeffe486c502d22f9258ef12efd00c5d99
Handle @New Episode
parsed_files: 379
parse_errors: 0
parse_timeouts: 0
n_annotations: 1605747

# 8b7b4d17ab14a2e7a098863398bbd756d50afa52
file continuation handling
parsed_files: 377
parse_errors: 2
parse_timeouts: 0
n_annotations: 1597993

# 77c4d936d21fdbb38780074ae255f12a3175d48b
handle missing timestamps
parsed_files: 270
parse_errors: 109
parse_timeouts: 0
n_annotations: 1086819

# 720aec6b8328cd9e2e721ee53c70542522dcf3c6
file codec w/python3 strings
parsed_files: 67
parse_errors: 71
parse_timeouts: 241
n_annotations: 219883

# 399581baa5828a441eb682e1f8c96827606416b9
previous HEAD
parsed_files: 0
parse_errors: 379
parse_timeouts: 0
n_annotations: 0

Test script:

import glob
import multiprocessing

import pympi

corpus_root = "./chat"

def main():
    parsed_files = 0
    parse_errors = 0
    parse_timeouts = 0
    n_annotations = 0

    q = multiprocessing.Queue()

    for file_path in glob.glob('{}/*.cha'.format(corpus_root)):
        print(file_path)
        p = multiprocessing.Process(target=parse, args=(q, file_path))
        p.start()
        p.join(3)
        if p.is_alive():
            parse_timeouts += 1
            continue
        parsed, annotations = q.get()
        if not parsed:
            parse_errors += 1
            continue
        parsed_files += 1
        n_annotations += annotations
    print(
        "parsed_files:", parsed_files, "parse_errors:", parse_errors,
        "parse_timeouts:", parse_timeouts, "n_annotations:", n_annotations
    )

def parse(q, file_path):
    try:
        eafob = pympi.Elan.eaf_from_chat(file_path)
        q.put((1, len(eafob.annotations)))
    except:
        q.put((0, 0))
    return

if __name__ == "__main__":
    main()

There appear to be some tests which do not pass, but this does not seem to be the result of changes made in this commit.

dopefishh / pympi

Jt/py3 chat parsing #49

Purpose of this PR

How it works

Testing