Simon-Initiative / course-digest

Tool to produce a summary or digest of OLI course package contents
MIT License
2 stars 0 forks source link

[Bugfix] Handle mixed block and inline content in submit & compare feedback #174

Closed marc-hughes closed 1 year ago

marc-hughes commented 1 year ago

The reported bug was that there wasn't enough spacing between paragraphs here:

image

Upon investigation, I found that the content in that feedback consisted of a text node, a paragraph, and a text node. That is not valid slate-js content.

Side Note: upon opening it in authoring, the content would get normalized and fixed if the user made any change whatsoever to the page.

The feedback was being passed through ensureParagraphs in questions/common.ts which I think was meant to prevent things like this, but ensureParagraphs only catches the case where all the children are inline, not when they're partially inline and partially block elements.

In this PR, there is a new wrapInlinesWithParagraphs that handles any combination of inline & block elements, wrapping all the inline, and keeping adjacent inlines in the same paragraph, but that's only used in one very specific case. It technically fixes this bug, and in the most surgical way possible.

If we were early in the process, I would advocate for swapping out the implementation of ensureParagraphs with wrapInlinesWithParagraphs for all content.

But when I tried that, it resulted in over 14,000 changes in the chemistry course. The chances of this causing unexpected content changes is a near certainty.

A few options on where to go next:

  1. Only apply this new method as surgically as possible like this PR does as issues come up.
  2. Apply this more broadly, but selectively. Such as to all feedback, all hints, or some subset of elements.
  3. Apply it everywhere, and deal with the consequences.
marc-hughes commented 1 year ago

For reference, the input XML we get is:

<explanation>Because the atomic mass on the periodic table is very close to 39, you
                    can conclude that the most abundant of the 3 isotopes is K-39. <p>In fact the
                        percent abundances are: K-39 is 93.3%. K-41 is 6.7%. K-40 is 0.012%. </p> It
                    is common that elements have one isotope that is far more abundant than the
                    other (although not always the case). If the atomic mass on the periodic table
                    is very close to a whole number, it is safe to assume that if you rounded that
                    value to the nearest whole number you would have the mass number of the most
                    common isotope for that element. Like we see with potassium.</explanation>
normanbier commented 1 year ago

For reference, the input XML we get is:

<explanation>Because the atomic mass on the periodic table is very close to 39, you
                    can conclude that the most abundant of the 3 isotopes is K-39. <p>In fact the
                        percent abundances are: K-39 is 93.3%. K-41 is 6.7%. K-40 is 0.012%. </p> It
                    is common that elements have one isotope that is far more abundant than the
                    other (although not always the case). If the atomic mass on the periodic table
                    is very close to a whole number, it is safe to assume that if you rounded that
                    value to the nearest whole number you would have the mass number of the most
                    common isotope for that element. Like we see with potassium.</explanation>

Is it possible that the visual case that folks are complaining about is a special case of explanation (which should, but isn't, enforcing paragraphs in tag)? Or do we think this is showing up in multiple places?

Is it possible to implement your suggested fix ("If we were early in the process, I would advocate for swapping out the implementation of ensureParagraphs with wrapInlinesWithParagraphs for all content.") AFTER we ingest Chemistry? IE if this is generally the right answer, but is problematic in the chem process, should we consider that reimplementation for future ingestion?

Do we have a few examples of what the 14k changes might look like? just trying to see a few examples of potential outcomes....

marc-hughes commented 1 year ago

@normanbier Yes, it's possible. So far, all of the changes I tracked down and looked at were visually identical before/after except for these feedback items. With so many potential changes, I'm cautious on claiming that's always the case.

I have since found that we are following this approach in some other targeted cases, such as list-items and question stems. (there's a similar function wrapLooseText I had overlooked). That doesn't change the core issue, but probably lends some weight to the targeted-approach method.

We could certainly change how course-digest works for different courses over time.

eliknebel commented 1 year ago

Since normalization logic would effectively be doing the same thing as the proposed change here, would it not be better to just do it up front during course-digest so that we know what changes this might cause rather than wait until someone edits the page?

marc-hughes commented 1 year ago

@eliknebel That's definitely another thing to consider. Right now, on most chemistry pages, every every time we open any page in authoring, we're getting a whole bunch of normalization rules firing and modifying things. I just opened a sample page and hit 14 of these top-level text rules fired.

Having a full-compliant output that didn't need any additional normalization would be a great goal and prevent any of the "I edited the doc, and this weird thing happened" bugs.

andersweinstein commented 1 year ago

I first dealt with this issue in the context of question stems in chemistry, adding wrapLooseText routine and applying wrapLooseText(ensureParagraphs(<content>)). Later applied that also to fixup of list items when block content was added to list items. Because of that history, this "fixup" operates on the javascript content representation formed after common restructuring transformations of the XML DOM, and also uses the two routines rather than one. It might be more appropriate to do it as part of DOM restructuring, but that is minor -- still have to target all elements to which it must be applied. Beefing up the existing "ensureParagraphs" routine to do the "wrapLooseText" job as well seems like it would do the job.

BTW wrapLooseTextis instrumented to optionally trace every case where it makes a change, which I found very helpful when debugging, though can produce a lot of output.

andersweinstein commented 1 year ago

Looks like wrapInlinesWithParagraphs duplicates collectTextsIntoParagraphs. Shouldn't we have only one? The former uses more accurate terminology -- more general "inline" as opposed to "text" -- but I think they have exactly the same effect.

marc-hughes commented 1 year ago

@andersweinstein yes, I'm going to combine them. Didn't find wrapLooseText until my comment above mentioning it.

marc-hughes commented 1 year ago

I did an export with option 3 above, where we run this for all ensureParagraph calls. I've been looking through all the differences in two ways. I scanned through the diff of hundreds of output files, and visually compared the rendered output of dozens of them.

Overall, I only found 3 types of unexpected visual changes and fixed those three. They all followed a pattern of putting in extra line breaks around elements that should have been inline and were obvious to catch in the diff scan I did (example attached).

I have more confidence in that option now. There still is a risk I missed something in my review, but that might be our best option now.

image

image