Closed marc-hughes closed 1 year ago
For reference, the input XML we get is:
<explanation>Because the atomic mass on the periodic table is very close to 39, you
can conclude that the most abundant of the 3 isotopes is K-39. <p>In fact the
percent abundances are: K-39 is 93.3%. K-41 is 6.7%. K-40 is 0.012%. </p> It
is common that elements have one isotope that is far more abundant than the
other (although not always the case). If the atomic mass on the periodic table
is very close to a whole number, it is safe to assume that if you rounded that
value to the nearest whole number you would have the mass number of the most
common isotope for that element. Like we see with potassium.</explanation>
For reference, the input XML we get is:
<explanation>Because the atomic mass on the periodic table is very close to 39, you can conclude that the most abundant of the 3 isotopes is K-39. <p>In fact the percent abundances are: K-39 is 93.3%. K-41 is 6.7%. K-40 is 0.012%. </p> It is common that elements have one isotope that is far more abundant than the other (although not always the case). If the atomic mass on the periodic table is very close to a whole number, it is safe to assume that if you rounded that value to the nearest whole number you would have the mass number of the most common isotope for that element. Like we see with potassium.</explanation>
Is it possible that the visual case that folks are complaining about is a special case of explanation (which should, but isn't, enforcing paragraphs in tag)? Or do we think this is showing up in multiple places?
Is it possible to implement your suggested fix ("If we were early in the process, I would advocate for swapping out the implementation of ensureParagraphs with wrapInlinesWithParagraphs for all content.") AFTER we ingest Chemistry? IE if this is generally the right answer, but is problematic in the chem process, should we consider that reimplementation for future ingestion?
Do we have a few examples of what the 14k changes might look like? just trying to see a few examples of potential outcomes....
@normanbier Yes, it's possible. So far, all of the changes I tracked down and looked at were visually identical before/after except for these feedback items. With so many potential changes, I'm cautious on claiming that's always the case.
I have since found that we are following this approach in some other targeted cases, such as list-items and question stems. (there's a similar function wrapLooseText
I had overlooked). That doesn't change the core issue, but probably lends some weight to the targeted-approach method.
We could certainly change how course-digest works for different courses over time.
Since normalization logic would effectively be doing the same thing as the proposed change here, would it not be better to just do it up front during course-digest so that we know what changes this might cause rather than wait until someone edits the page?
@eliknebel That's definitely another thing to consider. Right now, on most chemistry pages, every every time we open any page in authoring, we're getting a whole bunch of normalization rules firing and modifying things. I just opened a sample page and hit 14 of these top-level text rules fired.
Having a full-compliant output that didn't need any additional normalization would be a great goal and prevent any of the "I edited the doc, and this weird thing happened" bugs.
I first dealt with this issue in the context of question stems in chemistry, adding wrapLooseText
routine and applying wrapLooseText(ensureParagraphs(<content>))
. Later applied that also to fixup of list items when block content was added to list items. Because of that history, this "fixup" operates on the javascript content representation formed after common restructuring transformations of the XML DOM, and also uses the two routines rather than one. It might be more appropriate to do it as part of DOM restructuring, but that is minor -- still have to target all elements to which it must be applied. Beefing up the existing "ensureParagraphs" routine to do the "wrapLooseText" job as well seems like it would do the job.
BTW wrapLooseText
is instrumented to optionally trace every case where it makes a change, which I found very helpful when debugging, though can produce a lot of output.
Looks like wrapInlinesWithParagraphs
duplicates collectTextsIntoParagraphs
. Shouldn't we have only one? The former uses more accurate terminology -- more general "inline" as opposed to "text" -- but I think they have exactly the same effect.
@andersweinstein yes, I'm going to combine them. Didn't find wrapLooseText until my comment above mentioning it.
I did an export with option 3 above, where we run this for all ensureParagraph calls. I've been looking through all the differences in two ways. I scanned through the diff of hundreds of output files, and visually compared the rendered output of dozens of them.
Overall, I only found 3 types of unexpected visual changes and fixed those three. They all followed a pattern of putting in extra line breaks around elements that should have been inline and were obvious to catch in the diff scan I did (example attached).
I have more confidence in that option now. There still is a risk I missed something in my review, but that might be our best option now.
The reported bug was that there wasn't enough spacing between paragraphs here:
Upon investigation, I found that the content in that feedback consisted of a text node, a paragraph, and a text node. That is not valid slate-js content.
Side Note: upon opening it in authoring, the content would get normalized and fixed if the user made any change whatsoever to the page.
The feedback was being passed through
ensureParagraphs
in questions/common.ts which I think was meant to prevent things like this, butensureParagraphs
only catches the case where all the children are inline, not when they're partially inline and partially block elements.In this PR, there is a new
wrapInlinesWithParagraphs
that handles any combination of inline & block elements, wrapping all the inline, and keeping adjacent inlines in the same paragraph, but that's only used in one very specific case. It technically fixes this bug, and in the most surgical way possible.If we were early in the process, I would advocate for swapping out the implementation of ensureParagraphs with wrapInlinesWithParagraphs for all content.
But when I tried that, it resulted in over 14,000 changes in the chemistry course. The chances of this causing unexpected content changes is a near certainty.
A few options on where to go next: