slow matchfile parsing due to duplicate lines check

huispaty commented 1 year ago

Parsing match files takes longer than in previous versions due to the duplicate lines check in importmatch.py. The attached test file (converted to .txt) takes > 3hrs to load. beethoven_op026_mv3.txt

manoskary commented 1 year ago

I wrote the duplicate line check but that does not sound reasonable. Are you sure the issue is from the duplicate line checking? I checked without the duplicate line validation but it still takes a long time (p.s. I didn't wait 3hrs though)

huispaty commented 1 year ago

Yes I'm sure the issue stems from that part. Without this check, the file takes ~27secs to load (as it is quite large). With the checks, it takes >3.5hrs. I noticed this mainly because I was working on different branches, one of which had not yet been merged into develop (thus main) and had a version without this duplicate line check functionality. My proposed solution looks like this:

    with open(filename) as f:
        raw_lines = f.read().splitlines()

    version = get_version(raw_lines[0])

    from_matchline_methods = FROM_MATCHLINE_METHODSV1
    if version < Version(1, 0, 0):
        from_matchline_methods = FROM_MATCHLINE_METHODSV0

    raw_lines = list(set(raw_lines))
    parsed_lines = [
        parse_matchline(line, from_matchline_methods, version) for line in raw_lines
    ]

    parsed_lines = [pl for pl in parsed_lines if pl is not None]

    mf = MatchFile(lines=parsed_lines)

Using this approach the same file takes ~25secs to load. Currently this is on my local branch only - it's not yet pushed as I would like to first address some raised issues that also relate to match file importing.

CPJKU / partitura

slow matchfile parsing due to duplicate lines check #306