Closed RevantCI closed 3 years ago
I have fixed this issue in this commit.
Currently the punctuations handled are only as per the following regex.
const punctPattern = new RegExp('^[,./;:\'"`~!@#$%^&*(){}[}|]');
for those punctuation marks that are outside of this list, space may be added if they occur at a marker boundary, like before.
Issue is Fixed
I'm still having this issue when using usfm-grammar
on https://git.door43.org/unfoldingWord/en_ult/ exporting as CSV with relaxed
.
For comparison, I just tested unfolding Word's usfm-js
(https://github.com/unfoldingWord/usfm-js) and it works great on the English texts I've been working with.
@jcuenod I am not able to locate where the issue is occurring in the shared usfm files. Can you point out a small usfm snippet, for which you are getting the space issue around punctuations?
With the source file: https://git.door43.org/unfoldingWord/en_ult/src/branch/master/08-RUT.usfm
Running:
npx usfm-grammar -l relaxed 08-RUT.usfm -o csv > 08-RUT.csv
I get output like:
"RUT","1","8","Then Naomi said to her two daughters - in - law, “ Go, return, each woman to the house of her mother. May Yahweh act with you in covenant faithfulness as you have acted with the dead and with me."
I had a look at this for a while but I think the problem is actually the assumptions baked into your punctPattern
and this:
if (punctPattern.test(text)) {
verseTextPartial = verseTextPartial.trim();
verseTextPartial += text;
} else {
verseTextPartial += ` ${text}`;
}
I was giving this a bunch of thought and realised that you can't assume something about whether a space precedes/succeeds punctuation or not. It's dependent on context.
There are a number of possibilities:
I think the way usfm-js
handles this is by reading the whitespace from the usfm files but, tbh, I didn't try to make too much sense of what's going on.
To be honest, I spent a while messing around with the grammarOperations
in usfm-grammar
so if my output doesn't match yours, it's possible I've inadvertently edited the local copy I'm running. I'm pretty sure I had these problems in Ruth, though. But if you have trouble reproducing my issue, let me know because it may well be on my end.
I am confident that the invocation I am using above is how I arrived at my initial issues. So if you can't reproduce them using that line, that would be interesting to know...
@jcuenod
if you have trouble reproducing my issue, let me know because it may well be on my end.
I am able to reproduce the issue too for RUT 1:8 in the file.
I think the problem is actually the assumptions baked into your punctPattern and this:
I agree. For one, we don't have "-" listed as a punctuation in our pattern. The assumption was that "-" would be used to indicate compound words and it was not expected to be broken up into multiple \w
s or milestones. The same assumption was for '
coming as in Peter's
. The space addition/non-addition problem would arise only if it occurs at marker boundaries.
Even if we add -
to our list of punctuations we would get daughters- in- law
which is not exactly what we would need. As you said, there is no one rule, that would apply to all punctuations, not even for the same punctuation in different context.
I think the way usfm-js handles this is by reading the whitespace from the usfm files but, tbh, I didn't try to make too much sense of what's going on.
Looking into usfm-js output I think it adds a {"type":"text", "text":" "}
for newlines present in the input usfm. In the case of daughters-in-law, though they are split up into multiple markers, they are all in one line. Thus usfm-js do not add extra space text. But I don't think, having new lines between markers have any such meaning as per usfm spec(pls, correct me if I am wrong).
Let us think through this and see how to handle this case the best. Any more suggestions from your end is also welcome.
Since there isn't a one-size-fits-all for this problem and if baking these cases directly into the grammar is complex (and probably incorrect given varying patterns followed by different languages), we could think of a post-processing step that tries to iron-out such corner cases. The downside is another pass through the text. Relatedly, I wonder if there is a way to normalize such cases in the preprocessing step.
Let me try to sum up the problem we have here
Context
When we have multiple usfm components(markers) within a verse and the verse text is split across these, we extract and concatenate the verse text snippets and provide a verseText
property with complete clean verse text.
Normally all punctuations or spaces coming within text are left untouched and we add a space in between while concatenating two text snippets.
The issue
While concatenating, if there is a punctuation mark at the boundaries of text snippets, adding a space may not always work.
Listing some possible cases, which are already mentioned in above discussion
What we do now
We only check for case 1, ie., check if second snippet starts with a listed punctuation mark, and if yes, do not add a space. Other ways add a space by default.
How to handle it
It is not possible to do any preprocessing for this as the text is split accross markers and we would know which all markers have verse texts in them and how to join them only once it is passed through grammar and grammar operations.
If we do post processing, with regex matching and replacements, there are couple of risks in it. One, we may change some punctautions which are coming within continuous text. It would be better to leave them as it is, as usages may vary across languages as we have already mentioned. Two, it may not be possible to capture context like in cases 4 and 5 with a regex match, as one quote may span across verses and it may not be possible to detect if current quote indicates a start or end without understanding the text.
So I fear trying to fix these minor cases we may introduce more issues.
How about we do this?
const punctPattern1 = new RegExp('^[,.-—/;:!?@$%^)}\]>”»]'); //no space before
const punctPattern2 = new RegExp('[-—/`@^&({[<“«]$'); // no space after
// both lists exculde ', ", *, &, #, ~, |, +, _
if (punctPattern1.test(text) || punctPattern2.test(verseTextPartial)) {
verseTextPartial += text;
} else {
verseTextPartial += ` ${text}`;
}
And leave the rest of the cases to default, ie., add a space
(Disclaimer: I realise I have no say in this so I offer my perspective completely willing to have it rejected without interaction.)
My experience with usfm-js
has been that it, although I have had some issues that have required extra attention related to their data structure and some tagging that leaks into the output, the library works well with texts including those from other sources (like https://ebible.org mentioned in the OP).
It sounds as though the spec doesn't give clear guidance on how to handle whitespace. The decision in usfm-grammar
has been to make judgements based on punctuation matching. The decision in usfm-js
was to add whitespace at newlines (which has some precedent in the way browsers parse html). The fact that this decision seems to function better than punctuation matching and neither follows the spec per se inclines me to follow the unfoldingWord approach.
The main reason I am offering this suggestion is that, @kavitharaju, your punctuation suggestion above explicitly doesn't account for a number of types of punctuation that some texts contain (single and double quotes are the ones I'm thinking of).
@jcuenod I think I like your suggestion. This shouldn't break anything else. @kavitharaju, what do you think?
Just to clarify, you are suggesting that we check the whitespace usage in the usfm file to determine how to concatenate the parts of verseText. If there is newline between the markers, in text we add space and if not, join them without space. Is this correct?
We have a whitespace normalization step in preprocessing. We trim all texts(not just verseText) in grammar operations. And in rules we usually do not consider newLine
s and spaceCha
rs. So changing this might require changes through the code. I will have to try it out and see how much we will have to alter and how much the changes will affect the output. But we can try that.
@jcuenod, after looking into this we decided to go with the route @kavitharaju suggested above. The main reason is that the grammar is written from the ground-up to not look at the positioning of the tags and whitespaces/newlines between them. This makes sense as it more closely aligns with the spec. This may still, though, pose a problem with white spaces between text in the future which we will document. I will leave this issue open until we do that.
Aside from that, we are thinking of looking into an alternate back-end for USFM-Grammar that allows partial parses and faster output. This would be one of the things we would pay more attention to at that time.
Sounds good, thanks for taking a look and giving the feedback!
This is now merged and released in 2.0.0! You can find the note about this in the Whitespace inconsistancy
section of the Disclaimer.md.
On parsing the given usfm the output json has a space before the punctuation marks. 02-GENeng-asv.zip Source of usfm file - https://ebible.org/asv/ Click eng-asv_usfm.zip