WikiEducationFoundation / WikiEduDashboard

Wiki Education Foundation's Wikipedia course dashboard system
https://dashboard.wikiedu.org
MIT License
387 stars 625 forks source link

Article Viewer displays raw wikicode for "Health effects of electronic cigarettes" #5957

Open ragesoss opened 1 month ago

ragesoss commented 1 month ago

Visit here and wait for the authorship highlighting to load: https://dashboard.wikiedu.org/courses/UCSF/Foundations_II_(Summer_2024)/articles/edited?showArticle=52260526

Once it loads, the rendered article is replaced by highlighted wikicode:

Screenshot from 2024-09-10 12-12-23

Additional context

The ArticleViewer initially loads the parsed version of the current article, and requests the authorship data from the wikiwho server. Once received, the wikiwho data (which is annotated wikicode) gets processed by Dashboard code to add CSS classes on a per-author basis, it is sent to mediawiki to parse. No explicit errors are occurring in this example, either in the JS console or in network requests, but the call to the mediawiki API parse action is returning unparsed wikicode. One possible explanation is that the Dashboard code that operates on the wikiwho data is mishandling some particular aspect of this page's wikicode, resulting in a version that can't be parsed properly by mediawiki.

ragesoss commented 2 weeks ago

Here's another example: https://dashboard.wikiedu.org/courses/UCSD/Introduction_to_Policy_Analysis_-_Summer_Session24_(Summer)/articles/edited?showArticle=1007667

empty-codes commented 1 week ago

Hello @ragesoss, I would like to try working on this!

ragesoss commented 1 week ago

@empty-codes go for it. This one may be a challenge, as I'm not sure which codebase is ultimately responsible for the error. There's almost certainly something about the wikicode for these example articles that is triggering the bug, so just knowing precisely what triggers it would be helpful.

Abishekcs commented 1 week ago

Hi @ragesoss and @empty-codes,

I hope you're doing well! I wanted to share an observation I made while looking into this bug earlier today.

It seems that for the articles where this issue occurs most frequently, WikiWho is rendering the sup tag as shown below:

a) Opening tag: &lt;ref&gt; (where it should be <sup>) b) Closing tag: &lt;/ref&gt; (where it should be </sup>) c) Self-closing tag: &lt;ref /&gt; (where it should be <sup />)

For example in the actual HTML output, this appear as: <ref /> .

Interestingly, in articles where this bug does not occur, WikiWho outputs the <sup> tag correctly as <sup>.

Additionally, I've noticed that the bug tends to happen in articles where contributors have also added the citations. However, I did encounter at least one article where a citation didn't cause this issue, though I can't recall which article it was.

Please consider these as initial thoughts— as I’ve been trying to understand how WikiWho algorithm works. Hope my explanation was clear

Good luck with solving the bug @empty-codes!

empty-codes commented 1 week ago

@Abishekcs Thank you so much for your insights; they were really helpful for getting started.
I still have not solved the bug, but these are my findings so far @ragesoss : (sorry it's a bit long)

Firstly, I found that there are actually two different APIs involved:

  1. The WikiWhoAPI which is designed to parse historical revisions of Wikipedia articles, providing detailed provenance of each token (word) in terms of who added, removed, or reintroduced it across different revisions.

  2. The WhoColorAPI which is built on top of the WikiWho API and allows for the visualization of authorship data by color-coding tokens in the text based on their original authors. Wiki Edu Foundation employs this to show authorship data on its dashboard for students.

For this issue, the WhoColorAPI is the one we're concerned with.

Flow:

  1. Initially, the ArticleViewer component loads the parsed version of the article:

    <div id="article-scrollbox-id" className="article-scrollbox">
     {
       fetched ? <ParsedArticle highlightedHtml={highlightedHtml} whocolorHtml={whoColorHtml} parsedArticle={parsedArticle} /> : <Loading />
     }
    </div>

    The ParsedArticle component is defined in ParsedArticle.jsx:

    export const ParsedArticle = ({ highlightedHtml, whocolorHtml, parsedArticle }) => {
     let articleHTML = highlightedHtml || whocolorHtml || parsedArticle;

    The ParsedArticle component accepts highlightedHtml, whocolorHtml, and parsedArticle as props and displays one of them based on what is available.

  2. It then fetches authorship data from the WikiWho server.

  3. Once the authorship data is available, it replaces the initially rendered parsed article with the highlighted HTML (from whoColorHtml).

Here's the difference between the three props:

@Abishekcs identified that the problem seems to occur because WikiWho is incorrectly outputting <ref> tags instead of <sup> tags. This misrepresentation leads to HTML errors when the browser attempts to render the content since <ref> is not a recognized standard HTML tag.

Note that the MediaWiki parse action returns a parse.text property that correctly contains all the <sup> tags for both the affected and unaffected articles, which is why the page renders fine initially using the parsedArticle prop/state variable. So the problem is likely not from the MediaWiki API.

The fact that some articles display correctly while others do not suggests that there may be inconsistencies in how the WhoColor API processes certain revisions of articles.

So the questions are:

  1. Is it the whoColorHtml that replaces parts of the parsedArticle that are meant to be <sup> with <ref>, or is it the highlightedHtml that replaces the <sup> with <ref>?

  2. What exactly is it about the affected articles that is triggering the bug?

If it is the whoColorHtml replacing it, it is likely an issue with the WhoColor API; if it is the highlightedHtml, it is likely an issue with the highlightAuthors function logic in ArticleViewer.jsx.

Additionally, @Abishekcs also noticed that the bug tends to happen in articles where contributors have also added citations.

In this specific article: UCSF Foundations II, in the parse action response, there were parse warnings:

parsewarnings[ 
  "Script warning: <span style=\"color:#3a3\">One or more <code style=\"color: inherit; background: inherit; border: none; padding: inherit;\">&#123;{[[Template:cite journal|cite journal]]}}</code> templates have maintenance messages</span>; messages may be hidden ([[Help:CS1_errors#Controlling_error_message_display|help]])."
]

Since the bug appears more frequently in articles with numerous citations, it’s possible that these templates are not being parsed correctly. However, these parse warnings are inconsistent (and probably irrelevant) because they were present in another article that does not have this bug and also absent in another article that has this bug.

Keeping all these in mind, I will continue further investigation. Hopefully I can pinpoint a cause soon😅

Abishekcs commented 1 week ago

So the questions are:

  1. Is it the whoColorHtml that replaces parts of the parsedArticle that are meant to be <sup> with <ref>, or is it the highlightedHtml that replaces the <sup> with <ref>?

@empty-codes, I believe it's the whoColorHtml. I'm mentioning this because 😅 I reviewed the raw HTML output for whoColorHtml, and here's a small screenshot below. However, it might be a good idea to cross-check just to be sure.

Screenshot from 2024-10-04 17-27-46

Screenshot from 2024-10-04 17-27-21

empty-codes commented 1 week ago

@empty-codes, I believe it's the whoColorHtml. I'm mentioning this because 😅 I reviewed the raw HTML output for whoColorHtml, and here's a small screenshot below. However, it might be a good idea to cross-check just to be sure.

@Abishekcs That answers the question, thank you! I'll also crosscheck from my end.

empty-codes commented 1 week ago

I was stuck trying to find a lead for a while 😅 but I finally got somewhere (I think).

For the Health effects of electronic cigarettes article:

The NewPP limit report for the parsed ver

image

The NewPP limit report for the highlighted authorship ver

image

For the Hispanic and Latino Americans article:

The NewPP limit report for the parsed ver

image

The NewPP limit report for the highlighted authorship ver

image

The key issue seems to be that the templates are not being expanded, as indicated by the 0 bytes in the Post-expand include size and Template argument size fields, as well as the minimal expansion depth.

This page provides more context about the meaning of the terms.

At this point, I would like to ask for further guidance @Abishekcs @ragesoss. What steps should I take from here, please?
Also, I found this file that seems to contain the parsing logic: markuppreparser.inc.php. Is this still in use, and would it be relevant to this issue?

Thank you in advance!

ragesoss commented 1 week ago

That's interesting. The template expansion seems like a good clue, it's not obvious to me whether an expansion limit is involved, or whether it's being misparsed for some other reason. The unparsed ref tags seem likely to be relevant.

I can't tell whether that whoCOLOR repository is indirectly used for this. The main repo for the wikiwho-api servers is https://github.com/wikimedia/wikiwho_api

empty-codes commented 6 days ago

Noted! I will update you on any new findings @ragesoss

empty-codes commented 2 hours ago

@ragesoss Sorry for the lack of updates; I've been recovering from a minor cold but I'm back to working on this.

I’ve been focusing on identifying what triggers the bug on the affected pages, but I haven’t made much progress yet. The main clue is that templates on these pages are not expanding correctly, which seems to lead to <sup> tags being misinterpreted as <ref> tags.

In examining the Hispanic article, I found that in the working Wikipedia edit page, one template is:

{{Redirect-multi|2|Latinas|Latinos|other uses|Latina (disambiguation){{!}}Latina|and|Latino (disambiguation){{!}}Latino}}

However, in the WhoColor HTML output, it appears as:

{{Redirect-multi|2|Latinas|Latinos|other uses|Latina (disambiguation){{!}}<span class="editor-token token-editor-1152308" id="token-53">Latina</span><span class="editor-token token-editor-22831189" id="token-54">|</span>...}}

The presence of editor-token elements in the template definition seems to indicate that they are being improperly injected into the template content, likely interfering with normal rendering.

My current hypothesis is that, as seen in the wikiwho_api\api\tasks.py file, the tasks are constrained by soft time limits and cache timeouts, which could lead to incomplete template expansion for articles with many citations or heavy wikitext usage. If tokenization kicks in before the full expansion, it may cause improper injection of elements like editor-token.

Currently, I have set up the WikiWho API locally to test this hypothesis, I will update on my findings! Another possible cause could be the pickling of the articles but I haven't investigated that yet.