dopefishh / pympi

A python module for processing ELAN and Praat annotation files
MIT License
93 stars 39 forks source link

Jt/py3 chat parsing #49

Closed jackft closed 1 year ago

jackft commented 1 year ago

Purpose of this PR

This PR updates the eaf_from_chat function so that it

Other changes to pass tests

How it works

Testing

I tested this new script with all .cha files from the Manchester & Providence corpora.

# 36906ceeffe486c502d22f9258ef12efd00c5d99
Handle @New Episode
parsed_files: 379
parse_errors: 0
parse_timeouts: 0
n_annotations: 1605747

# 8b7b4d17ab14a2e7a098863398bbd756d50afa52
file continuation handling
parsed_files: 377
parse_errors: 2
parse_timeouts: 0
n_annotations: 1597993

# 77c4d936d21fdbb38780074ae255f12a3175d48b
handle missing timestamps
parsed_files: 270
parse_errors: 109
parse_timeouts: 0
n_annotations: 1086819

# 720aec6b8328cd9e2e721ee53c70542522dcf3c6
file codec w/python3 strings
parsed_files: 67
parse_errors: 71
parse_timeouts: 241
n_annotations: 219883

# 399581baa5828a441eb682e1f8c96827606416b9
previous HEAD
parsed_files: 0
parse_errors: 379
parse_timeouts: 0
n_annotations: 0

Test script:

import glob
import multiprocessing

import pympi

corpus_root = "./chat"

def main():
    parsed_files = 0
    parse_errors = 0
    parse_timeouts = 0
    n_annotations = 0

    q = multiprocessing.Queue()

    for file_path in glob.glob('{}/*.cha'.format(corpus_root)):
        print(file_path)
        p = multiprocessing.Process(target=parse, args=(q, file_path))
        p.start()
        p.join(3)
        if p.is_alive():
            parse_timeouts += 1
            continue
        parsed, annotations = q.get()
        if not parsed:
            parse_errors += 1
            continue
        parsed_files += 1
        n_annotations += annotations
    print(
        "parsed_files:", parsed_files, "parse_errors:", parse_errors,
        "parse_timeouts:", parse_timeouts, "n_annotations:", n_annotations
    )

def parse(q, file_path):
    try:
        eafob = pympi.Elan.eaf_from_chat(file_path)
        q.put((1, len(eafob.annotations)))
    except:
        q.put((0, 0))
    return

if __name__ == "__main__":
    main()

There appear to be some tests which do not pass, but this does not seem to be the result of changes made in this commit.

dopefishh commented 1 year ago

Thank you so much!