chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

lots of missing words in text extracted from wikimedia dumps #227

Closed ckot closed 5 years ago

ckot commented 5 years ago

example of a broken sentence I located. there are LOTS of these

Uranium is a chemical element which can be used in  and .

The culprit is that templates aren't being expanded, which is quite understandable. The original marked up sentence (this is enwikinews) is:

Uranium is a chemical element which can be used in {{w|nuclear weapon|nuclear weapons}} and {{w|nuclear power|nuclear power plants}}. 

Although I imagine you don't want to be in the business of expanding templates,I believe that the {{w}} template is so simple that it can be handled via regexp substitution, and perhaps should be considered low-enough hanging fruit, that a 1-liner pre-processing step of the pages content could be done prior to passing the content to mwparserfromhell

Such a preprocessing step, even if the generated wikilinks aren't perfect would vastly improve the number of grammatical sentences in the extracted text.

Expected Behavior

The marked up sentence:

Uranium is a chemical element which can be used in {{w|nuclear weapon|nuclear weapons}} and {{w|nuclear power|nuclear power plants}}. 

Should produce:

Uranium is a chemical element which can be used in nuclear weapons and nuclear power plants. 

Although templates aren't shared across the various mediawiki projects, they can be copied. In this instance the 'w' template exists in both enwikinews and enwiki but they're different and have slightly different semantics.

In wikinews, {{w|link|optional label|optional other params}} will product a wikilink local to wikinews, if it exists, or if not to wikipedia. In wikipedia, I believe it simply highlights the link as a currently non-existing link if it doesn't exist.

For the purpose of this software, I think the slight difference in semantics can be ignored, A simple regexp substitution converting them to [[\1]] (basically stripping the leading {{w| and trailing }} and wrapping what's in between with [[ and ]]) prior to parsing will cause these to be parsed as wikilinks. Currently, labels for wikilinks already in the traditional [[link|label]] format are present in the text.

Although, as I mentioned, the wikilinks generated by this aren't perfect (even if you were properly expanding the actual template, the parser wouldn't have information as to whether or the link exists either local to the particular wiki, or exists at all). I imagine this is of less interest to your users, which I'm guessing are NLP and/or ML folks who care most about the extracted text and categories, and would be happy to trade a little noise in the list of wiki links for reduced noise in the extracted text.

Current Behavior

Possible Solution

    w_tmpl = re.compile(r'\{\{w\|[^}]+?\}\}')

    def _parse_content(self. content, parser, fast, include_headings):
        unicode_ = compate.unicode_
        content = w_tmpl.sub(r'[[\1]]', content)
        wikicode = parser.parse(content)

I believe that such a change should be harmless at worst.

Let me know what you think. I'd be happy to submit a PR, although unless I've missed something (due to not having tested), I believe it would consist of exactly the code listed above. Unfortunately, the only way I currently know how to test is to compare output from before/after the change.

Steps to Reproduce (for bugs)

pass the following text to _parse_content() """ {{Iran nuclear program}}{{byline|date=November 15, 2004|location={{w|Tehran|TEHRAN}}}} {{w|Hassan Rowhani}}, head of the {{w|Supreme National Security Council}} for [[Iran]], announced Monday that the country would temporarily suspend conversion of {{w|uranium}} as of November 22. "Iran is planning to suspend uranium conversion activities from November 22," Rowhani said during a news conference.

Uranium is a chemical element which can be used in {{w|nuclear weapon|nuclear weapons}} and {{w|nuclear power|nuclear power plants}}. The process of conversion modifies the uranium oxydes into {{w|uranium hexafluoride}}. The purpose of such a conversion process is usually an intermediate step in the production of {{w|nuclear fuel}}. Uranium hexafluoride cannot itself be used in a nuclear weapon, but can be {{w|uranium enrichment|enriched}} into weapons-grade uranium, or it can be converted into {{w|plutonium}} in a {{w|nuclear reactor technology|nuclear reactor}}. Iran has claimed to be using its nuclear program for only peaceful nuclear energy, rather than for nuclear weapons, but there are concerns in the [[European Union]] and the [[United States|USA]] as to whether they are being truthful.

{{haveyoursay}} == Related news == {{wikipedia|Nuclear program of Iran}} *{{wikinews|title=Iran close to decision on nuclear program|date=November 13, 2004}}

== External links == *[http://www.nrc.gov/materials/fuel-cycle-fac/ur-conversion.html Uranium Conversion]

{{publish}} {{archive}} {{PD-Article}}

[[Category:Iran]] [[Category:Politics and conflicts]] [[Category:Nuclear proliferation]] [[Category:Nuclear weapons]] [[Category:Nuclear power]] """

Context

much of the text extracted from each page is ungrammatical. a problem which could be vastly improved by handling this simple template.

Your Environment

bdewilde commented 5 years ago

Hi @ckot, thanks for the detailed explanation and proposed solution, and apologies for the belated response. I'd started looking into this but never found any documentation on this "w template". Could you point me to a page that explains its usage? I just want to be sure that your solution doesn't have any unintended side-effects.

ckot commented 5 years ago

Hi @bdewilde, I've been meaning to get back to you on this, but have been a little busy. I'm not sure, but I think this might only be an issue for wikinews. I did some spot checking of the text from enwiki and didn't see problems, but spot checking a few articles on such a large corpus isn't exactly ideal.

I'm thinking, and I thought about this long ago when I first proposed adding support for wikinews was that perhaps we should change the Wikipedia class to Wikimedia, make Wikipedia a subclass of it (simply 'pass' for now) and make Wikinews a subclass of it as well. That way any wikiproject-specific processing, and perhaps this is the first instance of something like this, could be done without having to worry about breaking processing of other wikiprojects. What do you think?

Anyway, let me know your thoughts on this, and I'll lookup the 'w' template in the mean time.

bdewilde commented 5 years ago

Sub-classing a shared Wikimedia class for both Wikipedia and Wikinews probably isn't a bad idea, I'm just not sure how necessary it is. How different is the markup? Btw, I found documentation:

https://en.wikinews.org/wiki/Template:W https://commons.wikimedia.org/wiki/Template:W

bdewilde commented 5 years ago

Hey @ckot , I've opened a PR that addresses your issue, albeit in a different way than we'd expected: #235 . Any interest in trying it out?