Bridgeconn / usfm-grammar

An elegant USFM parser.
https://usfmgrammar.vachanengine.org/
MIT License
36 stars 14 forks source link

Parsing the usfm file gives the output with spaces in between #76

Closed RevantCI closed 3 years ago

RevantCI commented 4 years ago

On parsing the given usfm the output json has a space before the punctuation marks. 02-GENeng-asv.zip Source of usfm file - https://ebible.org/asv/ Click eng-asv_usfm.zip

kavitharaju commented 4 years ago

I have fixed this issue in this commit.

Currently the punctuations handled are only as per the following regex.

const punctPattern = new RegExp('^[,./;:\'"`~!@#$%^&*(){}[}|]');

for those punctuation marks that are outside of this list, space may be added if they occur at a marker boundary, like before.

RevantCI commented 4 years ago

Issue is Fixed

jcuenod commented 3 years ago

I'm still having this issue when using usfm-grammar on https://git.door43.org/unfoldingWord/en_ult/ exporting as CSV with relaxed.

jcuenod commented 3 years ago

For comparison, I just tested unfolding Word's usfm-js (https://github.com/unfoldingWord/usfm-js) and it works great on the English texts I've been working with.

kavitharaju commented 3 years ago

@jcuenod I am not able to locate where the issue is occurring in the shared usfm files. Can you point out a small usfm snippet, for which you are getting the space issue around punctuations?

jcuenod commented 3 years ago

With the source file: https://git.door43.org/unfoldingWord/en_ult/src/branch/master/08-RUT.usfm

Running:

npx usfm-grammar -l relaxed 08-RUT.usfm -o csv > 08-RUT.csv

I get output like:

"RUT","1","8","Then Naomi said to her two daughters - in - law, “ Go, return, each woman to the house of her mother. May Yahweh act with you in covenant faithfulness as you have acted with the dead and with me."
jcuenod commented 3 years ago

I had a look at this for a while but I think the problem is actually the assumptions baked into your punctPattern and this:

if (punctPattern.test(text)) {
    verseTextPartial = verseTextPartial.trim();
    verseTextPartial += text;
} else {
    verseTextPartial += ` ${text}`;
}

I was giving this a bunch of thought and realised that you can't assume something about whether a space precedes/succeeds punctuation or not. It's dependent on context.

There are a number of possibilities:

  1. Periods, commas, etc. seem to be succeeded by a space and not preceded by one.
  2. Single quotes and dashes have no spaces around them. E.g. "daughter - in - law" or "Peter's"
  3. Double quotes are preceded by a space when they begin a quotation but are succeeded by one if they end it. This only holds conditionally, though. Punctuation styles vary. Some systems have quotation marks following the last punctuation mark; others have them preceding it. It's also conceivable a text might use single quotes for a direct quote.

I think the way usfm-js handles this is by reading the whitespace from the usfm files but, tbh, I didn't try to make too much sense of what's going on.

jcuenod commented 3 years ago

To be honest, I spent a while messing around with the grammarOperations in usfm-grammar so if my output doesn't match yours, it's possible I've inadvertently edited the local copy I'm running. I'm pretty sure I had these problems in Ruth, though. But if you have trouble reproducing my issue, let me know because it may well be on my end.

I am confident that the invocation I am using above is how I arrived at my initial issues. So if you can't reproduce them using that line, that would be interesting to know...

kavitharaju commented 3 years ago

@jcuenod

if you have trouble reproducing my issue, let me know because it may well be on my end.

I am able to reproduce the issue too for RUT 1:8 in the file.

I think the problem is actually the assumptions baked into your punctPattern and this:

I agree. For one, we don't have "-" listed as a punctuation in our pattern. The assumption was that "-" would be used to indicate compound words and it was not expected to be broken up into multiple \ws or milestones. The same assumption was for ' coming as in Peter's. The space addition/non-addition problem would arise only if it occurs at marker boundaries.

Even if we add - to our list of punctuations we would get daughters- in- law which is not exactly what we would need. As you said, there is no one rule, that would apply to all punctuations, not even for the same punctuation in different context.

I think the way usfm-js handles this is by reading the whitespace from the usfm files but, tbh, I didn't try to make too much sense of what's going on.

Looking into usfm-js output I think it adds a {"type":"text", "text":" "} for newlines present in the input usfm. In the case of daughters-in-law, though they are split up into multiple markers, they are all in one line. Thus usfm-js do not add extra space text. But I don't think, having new lines between markers have any such meaning as per usfm spec(pls, correct me if I am wrong).

Let us think through this and see how to handle this case the best. Any more suggestions from your end is also welcome.

joelthe1 commented 3 years ago

Since there isn't a one-size-fits-all for this problem and if baking these cases directly into the grammar is complex (and probably incorrect given varying patterns followed by different languages), we could think of a post-processing step that tries to iron-out such corner cases. The downside is another pass through the text. Relatedly, I wonder if there is a way to normalize such cases in the preprocessing step.

kavitharaju commented 3 years ago

Let me try to sum up the problem we have here

Context When we have multiple usfm components(markers) within a verse and the verse text is split across these, we extract and concatenate the verse text snippets and provide a verseText property with complete clean verse text.

Normally all punctuations or spaces coming within text are left untouched and we add a space in between while concatenating two text snippets.

The issue

While concatenating, if there is a punctuation mark at the boundaries of text snippets, adding a space may not always work.

Listing some possible cases, which are already mentioned in above discussion

  1. "Hello dear" & ", good morning" >> no space required
  2. "Hello dear." & "Good morning" >> space required
  3. "daughter" , "-", "in", "-", "law" >> no space required at both joints
  4. "He said '" & "come to me'" >> no space required
  5. "'Come to me'" & "he said" >> space required

What we do now

We only check for case 1, ie., check if second snippet starts with a listed punctuation mark, and if yes, do not add a space. Other ways add a space by default.

How to handle it

It is not possible to do any preprocessing for this as the text is split accross markers and we would know which all markers have verse texts in them and how to join them only once it is passed through grammar and grammar operations.

If we do post processing, with regex matching and replacements, there are couple of risks in it. One, we may change some punctautions which are coming within continuous text. It would be better to leave them as it is, as usages may vary across languages as we have already mentioned. Two, it may not be possible to capture context like in cases 4 and 5 with a regex match, as one quote may span across verses and it may not be possible to detect if current quote indicates a start or end without understanding the text.

So I fear trying to fix these minor cases we may introduce more issues.

kavitharaju commented 3 years ago

How about we do this?


const punctPattern1 = new RegExp('^[,.-—/;:!?@$%^)}\]>”»]'); //no space before
const punctPattern2 = new RegExp('[-—/`@^&({[<“«]$'); // no space after
// both lists exculde ', ", *, &, #, ~, |, +, _ 

if (punctPattern1.test(text) || punctPattern2.test(verseTextPartial)) {
    verseTextPartial += text;
} else {
    verseTextPartial += ` ${text}`;
}

And leave the rest of the cases to default, ie., add a space

jcuenod commented 3 years ago

(Disclaimer: I realise I have no say in this so I offer my perspective completely willing to have it rejected without interaction.)

My experience with usfm-js has been that it, although I have had some issues that have required extra attention related to their data structure and some tagging that leaks into the output, the library works well with texts including those from other sources (like https://ebible.org mentioned in the OP).

It sounds as though the spec doesn't give clear guidance on how to handle whitespace. The decision in usfm-grammar has been to make judgements based on punctuation matching. The decision in usfm-js was to add whitespace at newlines (which has some precedent in the way browsers parse html). The fact that this decision seems to function better than punctuation matching and neither follows the spec per se inclines me to follow the unfoldingWord approach.

The main reason I am offering this suggestion is that, @kavitharaju, your punctuation suggestion above explicitly doesn't account for a number of types of punctuation that some texts contain (single and double quotes are the ones I'm thinking of).

joelthe1 commented 3 years ago

@jcuenod I think I like your suggestion. This shouldn't break anything else. @kavitharaju, what do you think?

kavitharaju commented 3 years ago

Just to clarify, you are suggesting that we check the whitespace usage in the usfm file to determine how to concatenate the parts of verseText. If there is newline between the markers, in text we add space and if not, join them without space. Is this correct?

kavitharaju commented 3 years ago

We have a whitespace normalization step in preprocessing. We trim all texts(not just verseText) in grammar operations. And in rules we usually do not consider newLines and spaceChars. So changing this might require changes through the code. I will have to try it out and see how much we will have to alter and how much the changes will affect the output. But we can try that.

joelthe1 commented 3 years ago

@jcuenod, after looking into this we decided to go with the route @kavitharaju suggested above. The main reason is that the grammar is written from the ground-up to not look at the positioning of the tags and whitespaces/newlines between them. This makes sense as it more closely aligns with the spec. This may still, though, pose a problem with white spaces between text in the future which we will document. I will leave this issue open until we do that.

Aside from that, we are thinking of looking into an alternate back-end for USFM-Grammar that allows partial parses and faster output. This would be one of the things we would pay more attention to at that time.

jcuenod commented 3 years ago

Sounds good, thanks for taking a look and giving the feedback!

joelthe1 commented 3 years ago

This is now merged and released in 2.0.0! You can find the note about this in the Whitespace inconsistancy section of the Disclaimer.md.