Closed ckot closed 5 years ago
Hi @ckot, thanks for the detailed explanation and proposed solution, and apologies for the belated response. I'd started looking into this but never found any documentation on this "w template". Could you point me to a page that explains its usage? I just want to be sure that your solution doesn't have any unintended side-effects.
Hi @bdewilde, I've been meaning to get back to you on this, but have been a little busy. I'm not sure, but I think this might only be an issue for wikinews. I did some spot checking of the text from enwiki and didn't see problems, but spot checking a few articles on such a large corpus isn't exactly ideal.
I'm thinking, and I thought about this long ago when I first proposed adding support for wikinews was that perhaps we should change the Wikipedia class to Wikimedia, make Wikipedia a subclass of it (simply 'pass' for now) and make Wikinews a subclass of it as well. That way any wikiproject-specific processing, and perhaps this is the first instance of something like this, could be done without having to worry about breaking processing of other wikiprojects. What do you think?
Anyway, let me know your thoughts on this, and I'll lookup the 'w' template in the mean time.
Sub-classing a shared Wikimedia
class for both Wikipedia
and Wikinews
probably isn't a bad idea, I'm just not sure how necessary it is. How different is the markup? Btw, I found documentation:
https://en.wikinews.org/wiki/Template:W https://commons.wikimedia.org/wiki/Template:W
Hey @ckot , I've opened a PR that addresses your issue, albeit in a different way than we'd expected: #235 . Any interest in trying it out?
example of a broken sentence I located. there are LOTS of these
The culprit is that templates aren't being expanded, which is quite understandable. The original marked up sentence (this is enwikinews) is:
Although I imagine you don't want to be in the business of expanding templates,I believe that the
{{w}}
template is so simple that it can be handled via regexp substitution, and perhaps should be considered low-enough hanging fruit, that a 1-liner pre-processing step of the pages content could be done prior to passing the content tomwparserfromhell
Such a preprocessing step, even if the generated wikilinks aren't perfect would vastly improve the number of grammatical sentences in the extracted text.
Expected Behavior
The marked up sentence:
Should produce:
Although templates aren't shared across the various mediawiki projects, they can be copied. In this instance the 'w' template exists in both
enwikinews
andenwiki
but they're different and have slightly different semantics.In wikinews,
{{w|link|optional label|optional other params}}
will product a wikilink local to wikinews, if it exists, or if not to wikipedia. In wikipedia, I believe it simply highlights the link as a currently non-existing link if it doesn't exist.For the purpose of this software, I think the slight difference in semantics can be ignored, A simple regexp substitution converting them to
[[\1]]
(basically stripping the leading{{w|
and trailing}}
and wrapping what's in between with[[
and]]
) prior to parsing will cause these to be parsed as wikilinks. Currently, labels for wikilinks already in the traditional [[link|label]] format are present in the text.Although, as I mentioned, the wikilinks generated by this aren't perfect (even if you were properly expanding the actual template, the parser wouldn't have information as to whether or the link exists either local to the particular wiki, or exists at all). I imagine this is of less interest to your users, which I'm guessing are NLP and/or ML folks who care most about the extracted text and categories, and would be happy to trade a little noise in the list of wiki links for reduced noise in the extracted text.
Current Behavior
Possible Solution
I believe that such a change should be harmless at worst.
Let me know what you think. I'd be happy to submit a PR, although unless I've missed something (due to not having tested), I believe it would consist of exactly the code listed above. Unfortunately, the only way I currently know how to test is to compare output from before/after the change.
Steps to Reproduce (for bugs)
pass the following text to
_parse_content()
""" {{Iran nuclear program}}{{byline|date=November 15, 2004|location={{w|Tehran|TEHRAN}}}} {{w|Hassan Rowhani}}, head of the {{w|Supreme National Security Council}} for [[Iran]], announced Monday that the country would temporarily suspend conversion of {{w|uranium}} as of November 22. "Iran is planning to suspend uranium conversion activities from November 22," Rowhani said during a news conference.Uranium is a chemical element which can be used in {{w|nuclear weapon|nuclear weapons}} and {{w|nuclear power|nuclear power plants}}. The process of conversion modifies the uranium oxydes into {{w|uranium hexafluoride}}. The purpose of such a conversion process is usually an intermediate step in the production of {{w|nuclear fuel}}. Uranium hexafluoride cannot itself be used in a nuclear weapon, but can be {{w|uranium enrichment|enriched}} into weapons-grade uranium, or it can be converted into {{w|plutonium}} in a {{w|nuclear reactor technology|nuclear reactor}}. Iran has claimed to be using its nuclear program for only peaceful nuclear energy, rather than for nuclear weapons, but there are concerns in the [[European Union]] and the [[United States|USA]] as to whether they are being truthful.
{{haveyoursay}} == Related news == {{wikipedia|Nuclear program of Iran}} *{{wikinews|title=Iran close to decision on nuclear program|date=November 13, 2004}}
== External links == *[http://www.nrc.gov/materials/fuel-cycle-fac/ur-conversion.html Uranium Conversion]
{{publish}} {{archive}} {{PD-Article}}
[[Category:Iran]] [[Category:Politics and conflicts]] [[Category:Nuclear proliferation]] [[Category:Nuclear weapons]] [[Category:Nuclear power]] """
Context
much of the text extracted from each page is ungrammatical. a problem which could be vastly improved by handling this simple template.
Your Environment
spacy
version: 2.0.18spacy
models: ['en_core_web_md']textacy
version: 0.6.2