Open ragesoss opened 1 month ago
Hello @ragesoss, I would like to try working on this!
@empty-codes go for it. This one may be a challenge, as I'm not sure which codebase is ultimately responsible for the error. There's almost certainly something about the wikicode for these example articles that is triggering the bug, so just knowing precisely what triggers it would be helpful.
Hi @ragesoss and @empty-codes,
I hope you're doing well! I wanted to share an observation I made while looking into this bug earlier today.
It seems that for the articles where this issue occurs most frequently, WikiWho
is rendering the sup tag
as shown below:
a) Opening tag: <ref> (where it should be <sup>)
b) Closing tag: </ref> (where it should be </sup>)
c) Self-closing tag: <ref /> (where it should be <sup />)
For example in the actual HTML output, this appear as: <ref />
.
Interestingly, in articles where this bug does not occur, WikiWho outputs the <sup> tag correctly as <sup>
.
Additionally, I've noticed that the bug tends to happen in articles where contributors have also added the citations
.
However, I did encounter at least one article where a citation didn't cause this issue
, though I can't recall which article it was.
Please consider these as initial thoughts— as I’ve been trying to understand how WikiWho algorithm works. Hope my explanation was clear
Good luck with solving the bug @empty-codes!
@Abishekcs Thank you so much for your insights; they were really helpful for getting started.
I still have not solved the bug, but these are my findings so far @ragesoss : (sorry it's a bit long)
Firstly, I found that there are actually two different APIs involved:
The WikiWhoAPI which is designed to parse historical revisions of Wikipedia articles, providing detailed provenance of each token (word) in terms of who added, removed, or reintroduced it across different revisions.
The WhoColorAPI which is built on top of the WikiWho API and allows for the visualization of authorship data by color-coding tokens in the text based on their original authors. Wiki Edu Foundation employs this to show authorship data on its dashboard for students.
For this issue, the WhoColorAPI is the one we're concerned with.
Initially, the ArticleViewer
component loads the parsed version of the article:
<div id="article-scrollbox-id" className="article-scrollbox">
{
fetched ? <ParsedArticle highlightedHtml={highlightedHtml} whocolorHtml={whoColorHtml} parsedArticle={parsedArticle} /> : <Loading />
}
</div>
The ParsedArticle
component is defined in ParsedArticle.jsx
:
export const ParsedArticle = ({ highlightedHtml, whocolorHtml, parsedArticle }) => {
let articleHTML = highlightedHtml || whocolorHtml || parsedArticle;
The ParsedArticle
component accepts highlightedHtml
, whocolorHtml
, and parsedArticle
as props and displays one of them based on what is available.
It then fetches authorship data from the WikiWho server.
Once the authorship data is available, it replaces the initially rendered parsed article with the highlighted HTML (from whoColorHtml
).
parsedArticle
: This is the basic version of the article fetched by the fetchParsedArticle
method that is initially rendered. This is just the plain article HTML without any authorship highlighting obtained from the MediaWiki API call using the parsedArticleURL(lastRevisionId)
method in the URLBuilder
, which returns a URL of this format:
`${base}/w/api.php?action=parse&oldid=${lastRevisionId}&disableeditsection=true&format=json`;
whoColorHtml
: This is the raw HTML returned directly by the WhoColor API. It includes the token-level spans that identify which editor added or modified specific parts of the text. This is obtained from articleviewer.jsx
by calling the fetchWhocolorHtml()
function, which further calls the __wikiwhoColorURLTimedRequestPromise(timeout, lastRevisionId)
function, which uses the wikiwhoColorURL
URLBuilder method to get the URL of the format below for the API call:
const url = `${WIKIWHO_DOMAIN}/${language}/whocolor/v1.0.0-beta/${encodeURIComponent(title)}/${revisionId}/`;
highlightedHtml
: This is the processed version of whoColorHtml
, where additional formatting and styling are applied to correctly display the authorship data (e.g., wrapping spans for tokens with additional attributes for styling). It is populated by the highlightAuthors
function, which uses the whoColorHtml
state:
// This takes the extended_html from the whoColor API, and replaces the span
// annotations with ones that are more convenient to style in React.
// The matching and replacing of spans is tightly coupled to the span format
// provided by the whoColor API: https://github.com/wikimedia/wikiwho_api
const highlightAuthors = () => {
let html = whoColorHtml;
// Replace each editor span for this user with one that includes their
// username and color class.
const prevHtml = html;
const colorClass = colors[i];
const styledAuthorSpan = `<span title="${user.name}" class="editor-token token-editor-${user.userid} ${colorClass}`;
const authorSpanMatcher = new RegExp(`<span class="editor-token token-editor-${user.userid}`, 'g');
html = html.replace(authorSpanMatcher, styledAuthorSpan);
// more logic and logic and logic
setHighlightedHtml(html); // highlightedHtml state variable populated here
setPendingRequest(false);
};
@Abishekcs identified that the problem seems to occur because WikiWho is incorrectly outputting <ref>
tags instead of <sup>
tags. This misrepresentation leads to HTML errors when the browser attempts to render the content since <ref>
is not a recognized standard HTML tag.
Note that the MediaWiki parse action returns a parse.text
property that correctly contains all the <sup>
tags for both the affected and unaffected articles, which is why the page renders fine initially using the parsedArticle
prop/state variable. So the problem is likely not from the MediaWiki API.
The fact that some articles display correctly while others do not suggests that there may be inconsistencies in how the WhoColor API processes certain revisions of articles.
So the questions are:
Is it the whoColorHtml
that replaces parts of the parsedArticle
that are meant to be <sup>
with <ref>
, or is it the highlightedHtml
that replaces the <sup>
with <ref>
?
What exactly is it about the affected articles that is triggering the bug?
If it is the whoColorHtml
replacing it, it is likely an issue with the WhoColor API; if it is the highlightedHtml
, it is likely an issue with the highlightAuthors
function logic in ArticleViewer.jsx
.
Additionally, @Abishekcs also noticed that the bug tends to happen in articles where contributors have also added citations.
In this specific article: UCSF Foundations II, in the parse action response, there were parse warnings:
parsewarnings[
"Script warning: <span style=\"color:#3a3\">One or more <code style=\"color: inherit; background: inherit; border: none; padding: inherit;\">{{[[Template:cite journal|cite journal]]}}</code> templates have maintenance messages</span>; messages may be hidden ([[Help:CS1_errors#Controlling_error_message_display|help]])."
]
Since the bug appears more frequently in articles with numerous citations, it’s possible that these templates are not being parsed correctly. However, these parse warnings are inconsistent (and probably irrelevant) because they were present in another article that does not have this bug and also absent in another article that has this bug.
Keeping all these in mind, I will continue further investigation. Hopefully I can pinpoint a cause soon😅
So the questions are:
- Is it the
whoColorHtml
that replaces parts of theparsedArticle
that are meant to be<sup>
with<ref>
, or is it thehighlightedHtml
that replaces the<sup>
with<ref>
?
@empty-codes, I believe it's the whoColorHtml. I'm mentioning this because 😅 I reviewed the raw HTML output for whoColorHtml, and here's a small screenshot below. However, it might be a good idea to cross-check just to be sure.
@empty-codes, I believe it's the whoColorHtml. I'm mentioning this because 😅 I reviewed the raw HTML output for whoColorHtml, and here's a small screenshot below. However, it might be a good idea to cross-check just to be sure.
@Abishekcs That answers the question, thank you! I'll also crosscheck from my end.
I was stuck trying to find a lead for a while 😅 but I finally got somewhere (I think).
The key issue seems to be that the templates are not being expanded, as indicated by the 0 bytes in the Post-expand include size and Template argument size fields, as well as the minimal expansion depth.
This page provides more context about the meaning of the terms.
At this point, I would like to ask for further guidance @Abishekcs @ragesoss. What steps should I take from here, please?
Also, I found this file that seems to contain the parsing logic: markuppreparser.inc.php. Is this still in use, and would it be relevant to this issue?
Thank you in advance!
That's interesting. The template expansion seems like a good clue, it's not obvious to me whether an expansion limit is involved, or whether it's being misparsed for some other reason. The unparsed ref
tags seem likely to be relevant.
I can't tell whether that whoCOLOR repository is indirectly used for this. The main repo for the wikiwho-api servers is https://github.com/wikimedia/wikiwho_api
Noted! I will update you on any new findings @ragesoss
@ragesoss Sorry for the lack of updates; I've been recovering from a minor cold but I'm back to working on this.
I’ve been focusing on identifying what triggers the bug on the affected pages, but I haven’t made much progress yet. The main clue is that templates on these pages are not expanding correctly, which seems to lead to <sup>
tags being misinterpreted as <ref>
tags.
In examining the Hispanic article, I found that in the working Wikipedia edit page, one template is:
{{Redirect-multi|2|Latinas|Latinos|other uses|Latina (disambiguation){{!}}Latina|and|Latino (disambiguation){{!}}Latino}}
However, in the WhoColor HTML output, it appears as:
{{Redirect-multi|2|Latinas|Latinos|other uses|Latina (disambiguation){{!}}<span class="editor-token token-editor-1152308" id="token-53">Latina</span><span class="editor-token token-editor-22831189" id="token-54">|</span>...}}
The presence of editor-token
elements in the template definition seems to indicate that they are being improperly injected into the template content, likely interfering with normal rendering.
My current hypothesis is that, as seen in the wikiwho_api\api\tasks.py
file, the tasks are constrained by soft time limits and cache timeouts, which could lead to incomplete template expansion for articles with many citations or heavy wikitext usage. If tokenization kicks in before the full expansion, it may cause improper injection of elements like editor-token.
Currently, I have set up the WikiWho API locally to test this hypothesis, I will update on my findings! Another possible cause could be the pickling of the articles but I haven't investigated that yet.
Visit here and wait for the authorship highlighting to load: https://dashboard.wikiedu.org/courses/UCSF/Foundations_II_(Summer_2024)/articles/edited?showArticle=52260526
Once it loads, the rendered article is replaced by highlighted wikicode:
Additional context
The ArticleViewer initially loads the parsed version of the current article, and requests the authorship data from the wikiwho server. Once received, the wikiwho data (which is annotated wikicode) gets processed by Dashboard code to add CSS classes on a per-author basis, it is sent to mediawiki to parse. No explicit errors are occurring in this example, either in the JS console or in network requests, but the call to the mediawiki API
parse
action is returning unparsed wikicode. One possible explanation is that the Dashboard code that operates on the wikiwho data is mishandling some particular aspect of this page's wikicode, resulting in a version that can't be parsed properly by mediawiki.