biotite-dev / biotite

A comprehensive library for computational molecular biology
https://www.biotite-python.org
BSD 3-Clause "New" or "Revised" License
581 stars 92 forks source link

Handle embedded quote in mmcif #619

Open 0ut0fcontrol opened 3 days ago

0ut0fcontrol commented 3 days ago

fix #570

use 3 regex patterns to match fields in one line for handle embed quote in mmcif file:

  single_quote_pattern = r"('(?:'(?! )|[^'])*')(?:\s|$)"
  double_quote_pattern = r'("(?:"(?! )|[^"])*")(?:\s|$)'
  unquoted_pattern = r"([^\s]+)"

GPT4 explain single_quote_pattern:

This regex single_quote_pattern = r"('(?:'(?! )|[^'])*')(?:\s|$)" is engineered to identify and extract substrings enclosed in single quotes from a larger text, with a particular sensitivity to handle internal apostrophes correctly. Let's dissect this expression to understand how it functions:

  1. ': This matches the opening single quote ' of the target substring.

  2. (?: ... ): This is a non-capturing group, which means it groups the contained pattern parts without storing the matched substring. This is used here mainly for grouping purposes without needing backreferences.

  3. '(?! ): This is a negative lookahead assertion that matches a single quote ' only if it's not immediately followed by a space `. This allows the regex to match apostrophes within words (like in contractions such asdon't`) without treating them as the end of the quoted substring.

  4. |: The logical OR operator presents an alternative within the non-capturing group. It separates the negative lookahead for internal apostrophes from the next part of the pattern.

  5. [^']: This is a negated character class that matches any character except a single quote '. This part of the expression ensures that the regex consumes all characters within the quotes until it encounters the next single quote, which might signify the end of the quoted substring.

  6. *``**: This quantifier applies to the non-capturing group, allowing the contained pattern to repeat any number of times — including zero times — thus enabling the regex to match quoted substrings of any length.

  7. ': Matches the closing single quote of the substring.

  8. (?:\s|$): Another non-capturing group that operates as a condition for what follows the closing quote. It matches either:

    • \s: A whitespace character, ensuring that the quoted substring is followed by a space, or
    • $: The end of a line or string, allowing for the quoted substring to appear at the end of the text.

The Key Points:

  • The pattern is designed to efficiently target substrings enclosed in single quotes within a larger string or document.
  • It smartly handles situations where an apostrophe is part of the enclosed text (like in contractions) without mistakenly recognizing it as the end of the quoted section.
  • By requiring the quoted substring to be followed by a space or the end of the text, it imposes a sensible boundary condition to identify discrete quoted substrings within a flow of text.

This regex could be particularly useful in text parsing applications where accurately distinguishing between quoted strings and regular text is crucial, such as in natural language processing tasks, data extraction, or in developing syntax highlighters for code editors.

0ut0fcontrol commented 3 days ago

@padix-key I'm sorry, I've been too busy with work and haven't had much time to delve into regex. Regex can be quite a headache.

Could you take a look at this solution? I'm not sure if the test covers all scenarios, and I'm thinking of adding more tests. Do you have any suggestions?

0ut0fcontrol commented 2 days ago

Thank you for your review. I will provide feedback as soon as possible. I will have time in the evening or during the weekend.