Regex problem with Single-quote replacement

FrankensteinVariorum / fv-web

new web front-end for the Frankenstein Variorum project, working with Astro & React

https://frankensteinvariorum.org/

GNU Affero General Public License v3.0

2 stars 0 forks source link

Regex problem with Single-quote replacement #13

Closed ebeshero closed 10 months ago

ebeshero commented 11 months ago

We see this on C08_app78 with the regex replacement around ['i', 'feel'] We recognized that there is a problem with the ECMAscript regex search and replace pattern in the seg.tsx file.

We propose that the regex pattern is too complicated because it's negative lookahead and negative lookbehind. Instead we should searching on these simple positive patterns only:

Search: ['
- Replace with: ["
Search: ',\s'
- Replace with: " "
Search: ']
- Replace: "]

@Yuying-Jin

Yuying-Jin commented 11 months ago

We see this on C08_app78 with the regex replacement around ['i', 'feel'] We recognized that there is a problem with the ECMAscript regex search and replace pattern in the seg.tsx file.

We propose that the regex pattern is too complicated because it's negative lookahead and negative lookbehind. Instead we should searching on these simple positive patterns only:

Search: ['

Replace with: ["

Search: ',\s'

Replace with: " "

Search: ']

Replace: "]

@Yuying-Jin

solved by n?.replace(/%q%/g, '\\"').replace(/([\[\]\s,<>])'/g, '$1"').replace(/'([\[\]\s<>,])/g, '"$1')

ebeshero commented 11 months ago

We think we fixed lots of these now, but we're concerned about double-quote replacement not happening properly in the normalized tokens representing <longToken> passages, such as __Coleridge's "Ancient Mariner" in MS C10:

<longToken><metamark>*</metamark> <anchor xml:id="c56-0049.01"/> 
<zone type="left_margin" sID="c56-0049__left_margin"/> 
<metamark>_______________</metamark> 
<milestone spanTo="#c56-0049.03" unit="tei:note"/> 
<metamark>*</metamark>Coleridge's "Ancient Mariner."  
<anchor xml:id="c56-0049.03"/>  
<zone eID="c56-0049__left_margin"/></longToken>

Python script replacement of double quotes currently is: normalized = re.sub(r'(â|â|")', '%q%', normalized)

ebeshero commented 11 months ago

NOTE: inside longToken, when we have <note resp="MWS"> the attribute value quotation marks are properly replaced by %q%. So why is it NOT working when the quotation marks are in the flattened text node of a <longToken> passage.

Yuying-Jin commented 10 months ago

The problem is from post-processing xslt file in collationWorkspace repo. We tried to replace &quot; with " to fix the problem about &. Now we replace &quots; to %q%. https://github.com/FrankensteinVariorum/collationWorkspace/blob/dab9a224cd2f86fcab29e3836429b93b0a5fc6e6/xslt/postProcessing.xsl#L67