Enable quoted string with whitespace in CIF file 'atom_site' category.

xvlaurent commented 1 month ago

I have to work with in-house CIF files that have label_atom_id including whitespaces (which is allowed by the official mmCIF dictionary), but I found that biotite does not expect whitespaces in atom_site category to optimize parsing performance.

I worked a bit on this performance issue and I can propose a function that should handle unquoting faster than the present regex implementation. I reused the benchmark from PR #619 to compare every solutions:

run number=1000000 times timeit.timeit('test_str_split(line)', number=number, globals=locals()) = 2.673100476997206 s
run number=1000000 times timeit.timeit('test_shlex_split(line)', number=number, globals=locals()) = 33.558549724999466 s
run number=1000000 times timeit.timeit('test_re_split(line)', number=number, globals=locals()) = 17.103288242000417 s
run number=1000000 times timeit.timeit('test_partitioned_split(line)', number=number, globals=locals()) = 4.747100908996799 s

Here is the testing code:

import re
import shlex

# Declare quoting patern and pre-compile regex once
single_quote_pattern = r"('(?:'(?! )|[^'])*')(?:\s|$)"
double_quote_pattern = r'("(?:"(?! )|[^"])*")(?:\s|$)'
unquoted_pattern = r"([^\s]+)"

# Combine the patterns using alternation
combined_pattern = f"{single_quote_pattern}|{double_quote_pattern}|{unquoted_pattern}"
FINAL_REGEX = re.compile(combined_pattern)

def split_line(line):
    values = line.split()
    for k in range(len(values)):
        # Remove quotes
        if (values[k][0] == '"' and values[k][-1] == '"') or (
            values[k][0] == "'" and values[k][-1] == "'"
        ):
            values[k] = values[k][1:-1]

def _split_one_line(line):
    """
    Split a line into its fields.
    Supporting embedded quotes (' or "), like `'a dog's life'` to  `a dog's life`
    """
    # Special case of multiline value, where the line starts with ';'
    if line[0] == ";":
        return [line[1:]]
    # Find all matches
    matches = FINAL_REGEX.findall(line)

    # Extract non-empty groups from the matches
    fields = []
    for match in matches:
        field = next(group for group in match if group)
        if field[0] == field[-1] == "'" or field[0] == field[-1] == '"':
            field = field[1:-1]
        fields.append(field)
    return fields

def partitioned_split_line(line: str) -> list[str]:
    # Special case of multiline value, where the line starts with ';'
    if line[0] == ";":
        return [line[1:]]

    # Initialize loop variables
    res = []
    concat_flag = False
    quote = ""

    # Loop over the line
    while line:
        # Split the line on whitespace
        word, _, line = line.partition(" ")
        # Handle the case where the word strat with a quote...
        if not concat_flag and word.startswith(("'", '"')):
            quote = word[0]
            # ... and ends with a quote as well, meaning we have a single quoted word
            if word.endswith(quote):
                res.append(word[1:-1])
                continue
            # ... and does not end with a quote, meaning we are in a multiword
            # quoted string.
            # We will need to concatenated the words until we find the closing quote
            concat_flag = True
            to_concat = word[1:]
        # Handle the case where we are in the middle of a multiword quoted string
        elif concat_flag and to_concat and quote:
            # Check if the word ends with the matching closing quote
            if word.endswith(quote):
                concat_flag = False
                to_concat += " " + word[:-1]
                quote = ""
                res.append(to_concat)
            # If not, we continue to concatenate the words
            else:
                to_concat += " " + word
        # Handle the easiest case, where we have an unquoted word
        else:
            res.append(word)
    return res

line = """ATOM 1 N PRO A 1 '1.400' "10.00" 10.000 1.00 0.00 N """

def test_re_split(line):
    _split_one_line(line)

def test_str_split(line):
    split_line(line)

def test_shlex_split(line):
    shlex.split(line)

def test_partitioned_split(line):
    partitioned_split_line(line)

if __name__ == "__main__":
    import timeit

    number = 1000000
    print(
        f"run {number=} times {timeit.timeit('test_str_split(line)', number=number, globals=locals()) = } s"
    )
    print(
        f"run {number=} times {timeit.timeit('test_shlex_split(line)', number=number, globals=locals()) = } s"
    )
    print(
        f"run {number=} times {timeit.timeit('test_re_split(line)', number=number, globals=locals()) = } s"
    )
    print(
        f"run {number=} times {timeit.timeit('test_partitioned_split(line)', number=number, globals=locals()) = } s"
    )

I copied the function _split_one_line presently used in biotite to test the perf improvement of a pre-compiled regex vs standard findall.

Would it be possible with the better perfs of partitioned_split_line to use this whitespace safe function to read atom_site blocks by default? Or at least to expose a parameter in the reader to enable it?

padix-key commented 1 month ago

Hi, this looks intriguing. It would be amazing to get rid of this rather hacky expect_whitespace parameter, with no significant performance penalty. Could you create a PR for this? Then we can check if this handles all the edge case in the test suite and how much the overall performance impact is, using the benchmarks in the CI.

xvlaurent commented 2 weeks ago

Hello!

I reworked the function to improve it and test it with Biotite's tests.

Here is the PR: https://github.com/biotite-dev/biotite/pull/686

biotite-dev / biotite

Enable quoted string with whitespace in CIF file 'atom_site' category. #673