OpenGreekAndLatin / First1KGreek

XML files for the works in the First Thousand Years of Greek Project. Please see our Wiki on how to contribute.
https://opengreekandlatin.github.io/First1KGreek/
Creative Commons Attribution Share Alike 4.0 International
91 stars 85 forks source link

match pattern issue #2777

Closed lcerrato closed 6 months ago

lcerrato commented 7 months ago

On further thought, the broader issue of the dots not being escaped in matchPatterns like this:

(.+).(.+) (.+).(.+). (.+).(.+).(.+) (.+).(.+).(.+).(.+)

is problematic too but there are 334 cases of these.

The following Python demonstrates the problem.

re.match(r"(.+).(.+)", "11.22").groups() ('11.', '2')

which is not intended. What we want is:

re.match(r"(.+).(.+)", "11.22").groups() ('11', '22')

James

On Mon, Feb 5, 2024 at 2:26 AM James Tauber [jtauber@jtauber.com](mailto:jtauber@jtauber.com) wrote:

In working on some of the plumbing, I wrote a quick validator for refsDecls and I think found a small number of issues with the refDecls in First1K

There are two kinds of problems: one is impossible matchPatterns like:

(\w+)(\w+)
(.+)(.+)
(.+)(.+)(.+)

which I think are just missing dot separators. The problem is these won't ever match multi-segment refs because the first group will capture everything.

Note that the dot separators should be escaped as \. not just . (this is a broader problem that might need to be addressed)

The second is cases where the number of capture groups in the matchPattern don't match the number of $-replacements in the replacementPattern.

Here's a list of files with the issues:

tlg0031.tlg002.1st1K-cop1.xml impossible match pattern
tlg0065.tlg001.1st1K-grc1.xml mismatched number of capture groups and replacements
tlg0086.tlg022.1st1K-grc1.xml impossible match pattern
tlg0363.tlg001.1st1K-grc1.xml impossible match pattern
tlg0565.tlg001.1st1K-grc1.xml mismatched number of capture groups and replacements
tlg0616.tlg001.1st1K-grc1.xml mismatched number of capture groups and replacements
tlg0643.tlg001.1st1K-grc1.xml mismatched number of capture groups and replacements
tlg1205.tlg002.perseus-grc2.xml mismatched number of capture groups and replacements
tlg1216.tlg001.opp-grc1.xml mismatched number of capture groups and replacements
tlg2001.tlg043.1st1K-lat1.xml mismatched number of capture groups and replacements
tlg2018.tlg002.1st1K-eng1.xml mismatched number of capture groups and replacements
tlg2022.tlg003.1st1K-grc1.xml impossible match pattern
tlg2058.tlg001.1st1K-grc1.xml mismatched number of capture groups and replacements
tlg2200.tlg00543.opp-grc1.xml impossible match pattern
tlg3118.tlg001.1st1K-grc1.xml mismatched number of capture groups and replacements
tlg4102.tlg011.1st1K-grc1.xml mismatched number of capture groups and replacements
tlg4102.tlg046.1st1K-grc1.xml impossible match pattern
tlg5022.tlg002.1st1K-grc1.xml impossible match pattern

Note I haven't checked these in Scaife but I can't see how they'd work, at least for some references.
lcerrato commented 6 months ago

changes to div structure (note only the 1st)

tlg0065.tlg001.1st1K-grc1.xml 4:115 (4:4) ** tlg0565.tlg001.1st1K-grc1.xml 4:86 (4:86) tlg0616.tlg001.1st1K-grc1.xml 8:333 (8:333) tlg0643.tlg001.1st1K-grc1.xml 39:1885 (39:1885) tlg1205.tlg002.perseus-grc2.xml 25:126 (25:126) tlg1216.tlg001.opp-grc1.xml 21:194 (21:194) tlg2001.tlg043.1st1K-lat1.xml 4:411 (4:411) tlg2018.tlg002.1st1K-eng1.xml 10:264 (10:264) tlg2058.tlg001.1st1K-grc1.xml 12:194 (12:193) tlg3118.tlg001.1st1K-grc1.xml 1:52 (1:52) tlg4102.tlg011.1st1K-grc1.xml 10:143:758 (10:143:758)

lcerrato commented 6 months ago

Note that some of these seem to be displaying just fine in the viewer and changes did not result in varied container counts. @jtauber