Open xvlaurent opened 1 month ago
Hi, this looks intriguing. It would be amazing to get rid of this rather hacky expect_whitespace
parameter, with no significant performance penalty. Could you create a PR for this? Then we can check if this handles all the edge case in the test suite and how much the overall performance impact is, using the benchmarks in the CI.
Hello!
I reworked the function to improve it and test it with Biotite's tests.
Here is the PR: https://github.com/biotite-dev/biotite/pull/686
I have to work with in-house CIF files that have label_atom_id including whitespaces (which is allowed by the official mmCIF dictionary), but I found that biotite does not expect whitespaces in atom_site category to optimize parsing performance.
I worked a bit on this performance issue and I can propose a function that should handle unquoting faster than the present regex implementation. I reused the benchmark from PR #619 to compare every solutions:
Here is the testing code:
I copied the function
_split_one_line
presently used in biotite to test the perf improvement of a pre-compiled regex vs standardfindall
.Would it be possible with the better perfs of
partitioned_split_line
to use this whitespace safe function to read atom_site blocks by default? Or at least to expose a parameter in the reader to enable it?