Open ragesoss opened 2 months ago
Hello @ragesoss, I would like to try working on this!
@empty-codes go for it. This one may be a challenge, as I'm not sure which codebase is ultimately responsible for the error. There's almost certainly something about the wikicode for these example articles that is triggering the bug, so just knowing precisely what triggers it would be helpful.
Hi @ragesoss and @empty-codes,
I hope you're doing well! I wanted to share an observation I made while looking into this bug earlier today.
It seems that for the articles where this issue occurs most frequently, WikiWho
is rendering the sup tag
as shown below:
a) Opening tag: <ref> (where it should be <sup>)
b) Closing tag: </ref> (where it should be </sup>)
c) Self-closing tag: <ref /> (where it should be <sup />)
For example in the actual HTML output, this appear as: <ref />
.
Interestingly, in articles where this bug does not occur, WikiWho outputs the <sup> tag correctly as <sup>
.
Additionally, I've noticed that the bug tends to happen in articles where contributors have also added the citations
.
However, I did encounter at least one article where a citation didn't cause this issue
, though I can't recall which article it was.
Please consider these as initial thoughts— as I’ve been trying to understand how WikiWho algorithm works. Hope my explanation was clear
Good luck with solving the bug @empty-codes!
@Abishekcs Thank you so much for your insights; they were really helpful for getting started.
I still have not solved the bug, but these are my findings so far @ragesoss : (sorry it's a bit long)
Firstly, I found that there are actually two different APIs involved:
The WikiWhoAPI which is designed to parse historical revisions of Wikipedia articles, providing detailed provenance of each token (word) in terms of who added, removed, or reintroduced it across different revisions.
The WhoColorAPI which is built on top of the WikiWho API and allows for the visualization of authorship data by color-coding tokens in the text based on their original authors. Wiki Edu Foundation employs this to show authorship data on its dashboard for students.
For this issue, the WhoColorAPI is the one we're concerned with.
Initially, the ArticleViewer
component loads the parsed version of the article:
<div id="article-scrollbox-id" className="article-scrollbox">
{
fetched ? <ParsedArticle highlightedHtml={highlightedHtml} whocolorHtml={whoColorHtml} parsedArticle={parsedArticle} /> : <Loading />
}
</div>
The ParsedArticle
component is defined in ParsedArticle.jsx
:
export const ParsedArticle = ({ highlightedHtml, whocolorHtml, parsedArticle }) => {
let articleHTML = highlightedHtml || whocolorHtml || parsedArticle;
The ParsedArticle
component accepts highlightedHtml
, whocolorHtml
, and parsedArticle
as props and displays one of them based on what is available.
It then fetches authorship data from the WikiWho server.
Once the authorship data is available, it replaces the initially rendered parsed article with the highlighted HTML (from whoColorHtml
).
parsedArticle
: This is the basic version of the article fetched by the fetchParsedArticle
method that is initially rendered. This is just the plain article HTML without any authorship highlighting obtained from the MediaWiki API call using the parsedArticleURL(lastRevisionId)
method in the URLBuilder
, which returns a URL of this format:
`${base}/w/api.php?action=parse&oldid=${lastRevisionId}&disableeditsection=true&format=json`;
whoColorHtml
: This is the raw HTML returned directly by the WhoColor API. It includes the token-level spans that identify which editor added or modified specific parts of the text. This is obtained from articleviewer.jsx
by calling the fetchWhocolorHtml()
function, which further calls the __wikiwhoColorURLTimedRequestPromise(timeout, lastRevisionId)
function, which uses the wikiwhoColorURL
URLBuilder method to get the URL of the format below for the API call:
const url = `${WIKIWHO_DOMAIN}/${language}/whocolor/v1.0.0-beta/${encodeURIComponent(title)}/${revisionId}/`;
highlightedHtml
: This is the processed version of whoColorHtml
, where additional formatting and styling are applied to correctly display the authorship data (e.g., wrapping spans for tokens with additional attributes for styling). It is populated by the highlightAuthors
function, which uses the whoColorHtml
state:
// This takes the extended_html from the whoColor API, and replaces the span
// annotations with ones that are more convenient to style in React.
// The matching and replacing of spans is tightly coupled to the span format
// provided by the whoColor API: https://github.com/wikimedia/wikiwho_api
const highlightAuthors = () => {
let html = whoColorHtml;
// Replace each editor span for this user with one that includes their
// username and color class.
const prevHtml = html;
const colorClass = colors[i];
const styledAuthorSpan = `<span title="${user.name}" class="editor-token token-editor-${user.userid} ${colorClass}`;
const authorSpanMatcher = new RegExp(`<span class="editor-token token-editor-${user.userid}`, 'g');
html = html.replace(authorSpanMatcher, styledAuthorSpan);
// more logic and logic and logic
setHighlightedHtml(html); // highlightedHtml state variable populated here
setPendingRequest(false);
};
@Abishekcs identified that the problem seems to occur because WikiWho is incorrectly outputting <ref>
tags instead of <sup>
tags. This misrepresentation leads to HTML errors when the browser attempts to render the content since <ref>
is not a recognized standard HTML tag.
Note that the MediaWiki parse action returns a parse.text
property that correctly contains all the <sup>
tags for both the affected and unaffected articles, which is why the page renders fine initially using the parsedArticle
prop/state variable. So the problem is likely not from the MediaWiki API.
The fact that some articles display correctly while others do not suggests that there may be inconsistencies in how the WhoColor API processes certain revisions of articles.
So the questions are:
Is it the whoColorHtml
that replaces parts of the parsedArticle
that are meant to be <sup>
with <ref>
, or is it the highlightedHtml
that replaces the <sup>
with <ref>
?
What exactly is it about the affected articles that is triggering the bug?
If it is the whoColorHtml
replacing it, it is likely an issue with the WhoColor API; if it is the highlightedHtml
, it is likely an issue with the highlightAuthors
function logic in ArticleViewer.jsx
.
Additionally, @Abishekcs also noticed that the bug tends to happen in articles where contributors have also added citations.
In this specific article: UCSF Foundations II, in the parse action response, there were parse warnings:
parsewarnings[
"Script warning: <span style=\"color:#3a3\">One or more <code style=\"color: inherit; background: inherit; border: none; padding: inherit;\">{{[[Template:cite journal|cite journal]]}}</code> templates have maintenance messages</span>; messages may be hidden ([[Help:CS1_errors#Controlling_error_message_display|help]])."
]
Since the bug appears more frequently in articles with numerous citations, it’s possible that these templates are not being parsed correctly. However, these parse warnings are inconsistent (and probably irrelevant) because they were present in another article that does not have this bug and also absent in another article that has this bug.
Keeping all these in mind, I will continue further investigation. Hopefully I can pinpoint a cause soon😅
So the questions are:
- Is it the
whoColorHtml
that replaces parts of theparsedArticle
that are meant to be<sup>
with<ref>
, or is it thehighlightedHtml
that replaces the<sup>
with<ref>
?
@empty-codes, I believe it's the whoColorHtml. I'm mentioning this because 😅 I reviewed the raw HTML output for whoColorHtml, and here's a small screenshot below. However, it might be a good idea to cross-check just to be sure.
@empty-codes, I believe it's the whoColorHtml. I'm mentioning this because 😅 I reviewed the raw HTML output for whoColorHtml, and here's a small screenshot below. However, it might be a good idea to cross-check just to be sure.
@Abishekcs That answers the question, thank you! I'll also crosscheck from my end.
I was stuck trying to find a lead for a while 😅 but I finally got somewhere (I think).
The key issue seems to be that the templates are not being expanded, as indicated by the 0 bytes in the Post-expand include size and Template argument size fields, as well as the minimal expansion depth.
This page provides more context about the meaning of the terms.
At this point, I would like to ask for further guidance @Abishekcs @ragesoss. What steps should I take from here, please?
Also, I found this file that seems to contain the parsing logic: markuppreparser.inc.php. Is this still in use, and would it be relevant to this issue?
Thank you in advance!
That's interesting. The template expansion seems like a good clue, it's not obvious to me whether an expansion limit is involved, or whether it's being misparsed for some other reason. The unparsed ref
tags seem likely to be relevant.
I can't tell whether that whoCOLOR repository is indirectly used for this. The main repo for the wikiwho-api servers is https://github.com/wikimedia/wikiwho_api
Noted! I will update you on any new findings @ragesoss
@ragesoss I successfully set up the wikiwho_api
locally by importing XML dumps and generating pickles for the relevant articles. While examining the Hispanic article, I noticed a discrepancy between a template in the wikitext outputs:
Correct:
{{Redirect-multi|2|Latinas|Latinos|other uses|Latina (disambiguation){{!}}Latina|and|Latino (disambiguation){{!}}Latino}}
In extended_html:
{{Redirect-multi|2|Latinas|Latinos|other uses|Latina (disambiguation){{!}}<span class="editor-token token-editor-1152308" id="token-53">Latina</span><span class="editor-token token-editor-22831189" id="token-54">|</span>...}}
I traced this issue to the parser logic in ~/wikiwho_api/env/lib/python3.9/site-packages/WhoColor/parser.py
and ~/wikiwho_api/env/lib/python3.9/site-packages/WhoColor/special_markups.py
. The parser fails to recognize nested templates, prematurely closing templates after encountering {{!}}
.
To fix this, I modified the __parse_wiki_text
method in parser.py
by introducing a template depth counter self.template_depth = 0
and a template stack self.template_stack = []
to track which templates are currently open and ensure they are only closed once all nested templates are closed.
While this change successfully eliminated the unwanted <span>
tags in the templates ✔️, the <ref>
tag bug persists.
Both the rev_text
and wiki_text
values correctly use formatted <ref>
tags, but the bug occurs at this following point, because of the parser.extended_wiki_text
generated. When i changed the argument of the convert_wiki_text_to_html
function to wiki_text
itself, the templates were properly expanded.
parser = WikiMarkupParser(wiki_text, whocolor_data['tokens'])
parser.generate_extended_wiki_markup()
extended_html = wp_rev_text_obj.convert_wiki_text_to_html(parser.extended_wiki_text)
I am currently investigating whether it is caused by the parser logic or token insertions or anything else. A little note, I have just been editing this comment instead of creating a new one each time, Thank you for your patience!
Hello @ragesoss, Here is my current update.
Initially, I attempted to resolve the issue of the parser prematurely closing templates upon encountering {{!}}
by modifying the __parse_wiki_text
method in parser.py
and introducing a template depth counter (self.template_depth = 0
) and a template stack (self.template_stack = []
), along with additional logic to track open templates.
However, this approach resulted in another bug involving duplicate {{
and }}
template tags, prompting me to revert the changes. Instead, I added new markup in special_markups.py
to specifically address template delimiters like {{!}}
:
{
'type': 'single',
'start_regex': re.compile(r'{{!}}'),
'end_regex': None,
'no_spans': True,
'no_jump': False
},
Note: I am aware the changes here are not permanent because the parser and special_markups py files are actually site packages/dependencies in a path like so: /home/emptycodes/wikiwho_api/env/lib/python3.9/site-packages/WhoColor/parser.py
.
This modification effectively eliminated the unwanted <span>
tags injected within templates; however, the <ref>
tag bug persists.
Both the rev_text
and wiki_text
values utilize correctly formatted <ref>
tags. However, the bug manifests when generating extended HTML:
parser = WikiMarkupParser(wiki_text, whocolor_data['tokens'])
parser.generate_extended_wiki_markup()
extended_html = wp_rev_text_obj.convert_wiki_text_to_html(parser.extended_wiki_text)
By changing the argument of the convert_wiki_text_to_html
function to wiki_text
, the templates were expanded correctly. This suggests a potential issue with the extended_wiki_text
generated by the parser.
The following are steps I have taken in investigating the bug:
I utilized the WikiTemplate UDL tool to compare the wikitexts (both the wikitext
and the parser.extended_wiki_text
of affected and unaffected pages, confirming no significant differences in template formats.
I attempted the following actions without success:
default_task_soft_time_limit
in deployment/celery_config.py
from 120 to 300 seconds, yet no change was observed.<ref>
tags in the markup file, which had no impact.I also utilized the Wikipedia Special:ExpandTemplates tool with both the wikitext and the expanded wikitext (including injected editor tokens) of a buggy page. The templates in the response expanded correctly with both of the input wikitexts, indicating that the injected editor tokens are not likely to be responsible for the problem.
In conclusion, I cannot pinpoint a cause because:
Would you recommend I continue working on this issue? I'm sure there's something I am missing but I cannot pinpoint exactly what it is. Thank you for your patience!
@empty-codes thanks! this is really useful documentation of your debugging work. I suggest leaving this one; hopefully we can find the next clue at a later time, but it's a relatively rare bug.
I just checked the second example with the Who Wrote That? tool on Wikipedia, and it also displays this buggy behavior (which makes sense based on your debugging, as it's clearly a problem with the WikiWho processing). So we can be pretty confident now that it's not a bug in our codebase.
One really useful way to wrap this up would be to open an issue on Phabricator against the Who-Wrote-That project, summarizing what you've learned about the like source of the bug within the WikiWho parser. There are some other issues there already related to pages that don't work as expected, but I don't see any that are clearly the same issue here, and I didn't spot anything along the lines of what you've done here to narrow down the source.
@ragesoss I've created the Phabricator issue here: https://phabricator.wikimedia.org/T377898
While I wasn't able to completely pinpoint the source of the bug, I learned a lot throughout the process and I'm glad this documentation will be useful. Thanks for your guidance throughout this process! 🙏
Visit here and wait for the authorship highlighting to load: https://dashboard.wikiedu.org/courses/UCSF/Foundations_II_(Summer_2024)/articles/edited?showArticle=52260526
Once it loads, the rendered article is replaced by highlighted wikicode:
Additional context
The ArticleViewer initially loads the parsed version of the current article, and requests the authorship data from the wikiwho server. Once received, the wikiwho data (which is annotated wikicode) gets processed by Dashboard code to add CSS classes on a per-author basis, it is sent to mediawiki to parse. No explicit errors are occurring in this example, either in the JS console or in network requests, but the call to the mediawiki API
parse
action is returning unparsed wikicode. One possible explanation is that the Dashboard code that operates on the wikiwho data is mishandling some particular aspect of this page's wikicode, resulting in a version that can't be parsed properly by mediawiki.