attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.73k stars 962 forks source link

Missing numbers in output #189

Open johnPertoft opened 5 years ago

johnPertoft commented 5 years ago

I tried to use this for a dump of Swedish Wikipedia but in a lot of cases I noticed that numbers are often missing from the output files. After some cross referencing between the output json files and the source xml file it seems related to {{formatnum}} which is a Magic word.

(parts of) Example article in xml file with this problem:

283928507:I omgivningarna runt Alkali Lake Indian Reserve 4A växer i huvudsak
[[barrskog]].<ref name = "nasalandcover"/> Trakten runt Alkali Lake Indian
Reserve 4A är nära nog obefolkad, med mindre än två invånare per kvadratkilometer.<ref
name = "nasapop"/>  Trakten ingår i den [[hemiboreala klimatzonen]].<ref name
= "koppen"/> [[Årsmedeltemperatur]]en i trakten är {{formatnum:3}}&nbsp
[[Grad Celsius|°C]]. Den varmaste månaden är juli, då medeltemperaturen är
{{formatnum:18}} °C, och den kallaste är december, med
{{formatnum:-10}} °C.<ref name = "nasa"/>

Corresponding article in json output file:

936:{"id": "5898341", "url": "https://sv.wikipedia.org/wiki?curid=5898341", "title": "Alkali Lake
Indian Reserve 4A", "text": "Alkali Lake Indian Reserve 4A\n\nAlkali Lake Indian Reserve 4A är
ett reservat i Kanada. Det ligger i provinsen British Columbia, i den sydvästra delen av landet,
km väster om huvudstaden Ottawa.\n\nI omgivningarna runt Alkali Lake Indian Reserve 4A växer
i huvudsak barrskog. Trakten runt Alkali Lake Indian Reserve 4A är nära nog obefolkad, med
mindre än två invånare per kvadratkilometer. Trakten ingår i den hemiboreala klimatzonen.
Årsmedeltemperaturen i trakten är  °C. Den varmaste månaden är juli, då medeltemperaturen är
 °C, och den kallaste är december, med  °C.\n"}

In the articles without this problem, number are (as far as I've seen) written as plain text without any magic words.

Is there any way to avoid this and other similar missing words? This issue is probably related to #151 and #153.

johnPertoft commented 5 years ago

I guess I need to add a parser function here?

johnPertoft commented 5 years ago

I've also noticed that in many cases with this parser function, the argument list is empty which seems to be related to source text like this ['formatnum:{{Stat/Sverige/Landskap/Befolkning|Götaland}}']. These are words in Swedish leading me to believe that there would some definition of how to interpret these somewhere?

Edit: This specific template is described here (in Swedish). Is this problem just a matter of including this template definition as the --templates argument?

DavidNemeskey commented 5 years ago

I am seeing missing numbers in both Hungarian and English WP outputs. In Hungarian, the template {{szám|384402}} is not expanded; an English example is {{cvt|384402|km|mi}} from the Moon's WP page. I preprocessed both dumps, and both Template:Cvt and Sablon:Szám are in the template listings.

It seems that when the code tries to expand szám or cvt, it either misparses the expanded macro (Hungarian case), or fails to even collect it properly (English case).

I attached two logs of what is happening. They are full of random stuff, for which I apologize, but it gives an idea of what is happening.

en.txt hu.txt

johnPertoft commented 5 years ago

Thanks for responding. I didn't have time to look further into this, but do you know if this script should be able to properly parse and look up values if given the correct input files?

I've learnt that a Wikipedia dump consists of a lot of files, some of which are sql files to recreate databases which I'm guessing are needed to fill in certain values, like maybe my previous example {{Stat/Sverige/Landskap/Befolkning|Götaland}}. This should expand into the population size of a certain area of Sweden. Do you know if this script is able to retrieve this information from somewhere given the right input arguments?

DavidNemeskey commented 5 years ago

As I understand, this script takes as input a single bz2 file, so I just gave it the full pages-articles bz2 dump. That one includes the template pages, and they are extracted into the templates file (--templates), too -- only substitution than fails. So I guess we are stuck with this bug for now.

HjalmarrSv commented 4 years ago

I have noticed the same. Especially on newer machine generated articles, where it costs nothing to burden the text with a lot of markup, necessary, or not.

Regarding numbers, they are described as formatting here: https://www.mediawiki.org/wiki/Help:Magic_words

There are a lot of possibilities of user generated formatting. They could be regarded as too much work for this tool, unless it impacts the uses of the text. Some projects recommend the json output, which may be the way to go if there are less errors. It may be easier to just remove key words from json, than to interpret a dump to text.

Has anyone fixed this problem either way?, i.e. fixed the parsing or cleaning functions? Or are there better dumps or tools?

mpagli commented 4 years ago

Same here. Would love to find a way to fix that.

HjalmarrSv commented 4 years ago

I did a quick fix that may work. Since template expansion seems to be outdated, why not used already expanded text: the cirrussearch dumps. It seems to work. And there are many mirrors to download from.

I had to do a minor code update, and added a text-only option. Please feel free to copy: https://github.com/HjalmarrSv/wikiextractor/blob/master/cirrus-extract.py

It may be possible to integrate the cirrus reader into WikiExtractor, if there is such a need.

mpagli commented 4 years ago

Thx for sharing! I tried the same thing, it works well to get the text with numbers, but I also get all the references mixed with the text. I'm still not sure how I can filter those out properly.

HjalmarrSv commented 4 years ago

With two spaces before the caret references were still there. Changing to one space removes all of them: text = re.sub(r' \^ .*$', '', text) the $ is not needed, re is greedy so will match to the end even without the end sign ($) Still some links left though...

mpagli commented 4 years ago

This would only remove a tiny fraction of references. Also, I'm not sure what you want to do with your wiki corpus, but in the eventuality that you'd like to train some language model, here is what you will get from the cirrus dump:

UNITY is a programming language constructed by K. Mani Chandy and Jayadev Misra for their book Parallel Program Design: A Foundation. It is a theoretical language which focuses on what, instead of where, when or how. The language contains no method of flow control, and program statements run in a nondeterministic way until statements cease to cause changes during execution. This allows for programs to run indefinitely, such as auto-pilot or power plant safety systems, as well as programs that would normally terminate (which here converge to a fixed point). All statements are assignments, and are separated by #. A statement can consist of multiple assignments, of the form a,b,c := x,y,z, or a := x || b := y || c := z. You can also have a quantified statement list, <# x,y : expression :: statement>, where x and y are chosen randomly among the values that satisfy expression. A quantified assignment is similar. In <|| x,y : expression :: statement >, statement is executed simultaneously for all pairs of x and y that satisfy expression. Bubble sort the array by comparing adjacent numbers, and swapping them if they are in the wrong order. Using Θ ( n ) {\displaystyle \Theta (n)} expected time, Θ ( n ) {\displaystyle \Theta (n)} processors and Θ ( n 2 ) {\displaystyle \Theta (n^{2})} expected work. The reason you only have Θ ( n ) {\displaystyle \Theta (n)} expected time, is that k is always chosen randomly from { 0 , 1 } {\displaystyle \{0,1\}} . This can be fixed by flipping k manually. Program bubblesort declare n: integer, A: array [0..n-1] of integer initially n = 20 # <|| i : 0 <= i and i < n :: A[i] = rand() % 100 > assign <# k : 0 <= k < 2 :: <|| i : i % 2 = k and 0 <= i < n - 1 :: A[i], A[i+1] := A[i+1], A[i] if A[i] > A[i+1] > > end You can sort in Θ ( log ⁡ n ) {\displaystyle \Theta (\log n)} time with rank-sort. You need Θ ( n 2 ) {\displaystyle \Theta (n^{2})} processors, and do Θ ( n 2 ) {\displaystyle \Theta (n^{2})} work. Program ranksort declare n: integer, A,R: array [0..n-1] of integer initially n = 15 # <|| i : 0 <= i < n :: A[i], R[i] = rand() % 100, i > assign <|| i : 0 <= i < n :: R[i] := <+ j : 0 <= j < n and (A[j] < A[i] or (A[j] = A[i] and j < i)) :: 1 > > # <|| i : 0 <= i < n :: A[R[i]] := A[i] > end Using the Floyd–Warshall algorithm all pairs shortest path algorithm, we include intermediate nodes iteratively, and get Θ ( n ) {\displaystyle \Theta (n)} time, using Θ ( n 2 ) {\displaystyle \Theta (n^{2})} processors and Θ ( n 3 ) {\displaystyle \Theta (n^{3})} work. Program shortestpath declare n,k: integer, D: array [0..n-1, 0..n-1] of integer initially n = 10 # k = 0 # <|| i,j : 0 <= i < n and 0 <= j < n :: D[i,j] = rand() % 100 > assign <|| i,j : 0 <= i < n and 0 <= j < n :: D[i,j] := min(D[i,j], D[i,k] + D[k,j]) > || k := k + 1 if k < n - 1 end We can do this even faster. The following programs computes all pairs shortest path in Θ ( log 2 ⁡ n ) {\displaystyle \Theta (\log ^{2}n)} time, using Θ ( n 3 ) {\displaystyle \Theta (n^{3})} processors and Θ ( n 3 log ⁡ n ) {\displaystyle \Theta (n^{3}\log n)} work. Program shortestpath2 declare n: integer, D: array [0..n-1, 0..n-1] of integer initially n = 10 # <|| i,j : 0 <= i < n and 0 <= j < n :: D[i,j] = rand() % 10 > assign <|| i,j : 0 <= i < n and 0 <= j < n :: D[i,j] := min(D[i,j], <min k : 0 <= k < n :: D[i,k] + D[k,j] >) > end After round r {\displaystyle r} , D[i,j] contains the length of the shortest path from i {\displaystyle i} to j {\displaystyle j} of length 0 … r {\displaystyle 0\dots r} . In the next round, of length 0 … 2 r {\displaystyle 0\dots 2r} , and so on. K. Mani Chandy and Jayadev Misra (1988) Parallel Program Design: A Foundation.

As opposed to the same article extracted from the standard wiki dump (using wikiextractor):

UNITY is a programming language constructed by K. Mani Chandy and Jayadev Misra for their book "Parallel Program Design: A Foundation". It is a theoretical language which focuses on "what", instead of "where", "when" or "how". The language contains no method of flow control, and program statements run in a nondeterministic way until statements cease to cause changes during execution. This allows for programs to run indefinitely, such as auto-pilot or power plant safety systems, as well as programs that would normally terminate (which here converge to a fixed point). Section::::Description. All statements are assignments, and are separated by codice_1. A statement can consist of multiple assignments, of the form codice_2, or codice_3. You can also have a "quantified statement list", codice_4, where x and y are chosen randomly among the values that satisfy "expression". A "quantified assignment" is similar. In codice_5, "statement" is executed simultaneously for "all" pairs of codice_6 and codice_7 that satisfy "expression". Section::::Examples. Section::::Bubble sort. Bubble sort the array by comparing adjacent numbers, and swapping them if they are in the wrong order. Using formula_1 expected time, formula_1 processors and formula_3 expected work. The reason you only have formula_1 "expected" time, is that codice_8 is always chosen randomly from formula_5. This can be fixed by flipping codice_8 manually. Section::::Rank-sort. You can sort in formula_6 time with rank-sort. You need formula_3 processors, and do formula_3 work. Section::::Floyd–Warshall algorithm. Using the Floyd–Warshall algorithm all pairs shortest path algorithm, we include intermediate nodes iteratively, and get formula_1 time, using formula_3 processors and formula_11 work. We can do this even faster. The following programs computes all pairs shortest path in formula_12 time, using formula_11 processors and formula_14 work. After round formula_15, codice_10 contains the length of the shortest path from formula_16 to formula_17 of length formula_18. In the next round, of length formula_19, and so on.

It seems to me that working with the cirrus dump makes it really difficult to filter out annoying features such as formulas, references, ... I personally would feel more confident training on a corpus missing few numbers here and there rather than a corpus containing a significant fraction of noisy references and markup.

DavidNemeskey commented 4 years ago

Just to chime in with another alternative: I ended up using the Kiwix Wikipedia dumps. The dump consists of ZIM archives, which contain the pages in HTML format. This makes parsing the infoboxes and similar templates all but impossible, but for those of us who only need the text, it is actually easier to process than the abomination that Mediawiki markup is.

For those interested, I have created a repo for parsing these ZIM dumps into a very limited form of HTML (only <p> and list tags). I managed to parse both the Hungarian and English dumps with it.

HjalmarrSv commented 4 years ago

Great with alternatives! ZIM is new to me.

I do not like math. :-) Maybe filter out math articles... I have not thought about them. Will now!

HjalmarrSv commented 4 years ago

As johnPertoft pointed out above there is a place where formatnum should be. These lines can be added there. As far as I know nothing improves, but code readability. If there is a function that just drops the tags {{, }} when unknown function, then knowing the function may ruin that. Otherwise this is a step forward. The formatnum in itself would be simple to parse, but there is the template connection.

from https://en.wikipedia.org/wiki/Help:Magic_words

'#dateformat': lambda args: '', # not supported '#formatdate': lambda args: '', # not supported '#tag': lambda args: '', # not supported 'formatnum': lambda args: '', # not supported 'gender': lambda args: '', # not supported 'plural': lambda args: '', # not supported 'padleft': lambda args: '', # not supported 'padright': lambda args: '', # not supported 'plural': lambda *args: '', # not supported

I would also like to put 'as of' here, but find no support for this assumption. I proposed another temporary solution in the As of-thread. 'as of': lambda *args: '', # not supported. Note: this may be the wrong place in code for this function

lovasoa commented 4 years ago

This bug seems to be the root cause of most invalid sentences on http://voice.mozilla.org.

HjalmarrSv commented 4 years ago

In above thread {{date de naissance-|17|octobre|1973}} is mentioned as not handled by WikiExtractor. Which I guess is a {{dateformat}} or {{formatdate}} parser tag.

There is a proposal for making a centralised function for handling parser tags in multiple languages. In this proposal there is also a list of a number of problems the lack of a central function is causing.

Two of the problems identified are relevant to this thread:

"For people editing in different languages, templates make translation harder. When translating a page, templates are much harder to handle than the article text (“prose”), whether the translation is done manually or with Content Translation. Users often have to skip the template or to correct it after the article was published. This also causes abandoning translations in progress, because template translation looks intimidating."

"Content Translation has a template adaptation feature, which automates some parts of this process, but it works only if a corresponding template exists in both languages, and if all the parameters were meticulously mapped by the template maintainers. This must be done for each template in each language separately and manually, and continuously maintained when the source template changes. This happens even though the templates’ function across the languages is the same most of the time."

https://www.mediawiki.org/wiki/Global_templates/Proposed_specification [This page was last edited on 19 January 2020, at 12:03]

Some problems that have been identified must reasonably be solved at the level where they are created. Automatic template expansion and parsing, primarily with tag-names in english, may be the furthest this project can possibly handle. That could mean support for the idea of pre-parsing the .xml-code for translating it to non-local tags.

Edit: Ideally the "pre-parsing"would be done by the template maintaners in the templates themselves. Otherwise, pending a central function, there would be a need to write a translator by anyone interested in a specific language.

HjalmarrSv commented 4 years ago

'formatnum': lambda extr, string, *rest: string,

seems to work for me (disambig also works suddenly, so may also depend on something else I did/reverted) However, decimal separator is dot (.), languages using comma are in the cold with this version. Maybe use strip (edit: re.split?) if text only interesting? swedish lake areas "area på 0.0518 kvadratkilometer" will then become 0 sqkm in many robot texts. what will "robot intelligence" "think" about point objects and physical properties of this world? :-)

this should work on other parser tags where keeping the value is enough.

HjalmarrSv commented 4 years ago

If someone else can replicate the result, we could call this problem closed.

In the same way as with 'int': lambda extr, string, rest: text_type(int(string)), it is possbile to do an operation on the string. With import decimal there is not even a need to convert from string, to be able to use roundup (to avoid zero values) if decimal markers are to be avoided.

if a formatnum fails now, it is probably because a template was not expanded before parsing formatnum, in which case the template possibly is still in the text.

I have not tried, but if you are not interested in numbers, replace the string argument with "number" or "nummer" or "your choice". Could work.

A function could be called that localized the decimal point (. or ,) to user locale or article language.

I see no reason why it should not work to copy "formatnum ..." and also have other languages., thus not getting stuck on the "template expansion and translation"-problem above.

I will try the dateformat and formatdate with re.sub, or equivalent. You can of course use the formatnum example and get the unprocessed argument (with pipe (|) symbols). Also here you can have any language as parser tag, to be able to process wikis in other languages.

HjalmarrSv commented 4 years ago

If you want comma then this works for me:

'formatnum': lambda extr, string, *rest: re.sub(r'\.', ',', string), #comma is decimal separator!

A pre-formated number may turn to rubbish, 1.000.000,00, would beome 1,000,000,00, or 1,000,000.00, would become 1,000,000,00. A function for proper parsing is needed, even though error rates may be small and even acceptable.

The entire list looks like this. Note that tag, dateformat and formatdate are experimental. I have not found them in my .xml.

parserFunctions = {

    '#expr': sharp_expr,

    '#if': sharp_if,

    '#ifeq': sharp_ifeq,

    '#iferror': sharp_iferror,

    '#ifexpr': lambda *args: '',  # not supported

    '#ifexist': lambda extr, title, ifex, ifnex: extr.expand(ifnex), # assuming title is not present

    '#rel2abs': lambda *args: '',  # not supported

    '#switch': sharp_switch,

    '#language': lambda *args: '', # not supported

    '#time': lambda *args: '',     # not supported

    '#timel': lambda *args: '',    # not supported

    '#titleparts': lambda *args: '', # not supported

    '#dateformat': lambda extr, string, *rest: re.sub(r'\|', ' ', string), #raw: '#dateformat': lambda extr, string, *rest: string

    '#formatdate': lambda extr, string, *rest: re.sub(r'\|', ' ', string), #raw: '#formatdate': lambda extr, string, *rest: string

    '#tag': lambda xtr, string, *rest: string, 

    # This function is used in some pages to construct links
    # http://meta.wikimedia.org/wiki/Help:URL
    'urlencode': lambda extr, string, *rest: quote(string.encode('utf-8')),

    'lc': lambda extr, string, *rest: string.lower() if string else '',

    'lcfirst': lambda extr, string, *rest: lcfirst(string),

    'uc': lambda extr, string, *rest: string.upper() if string else '',

    'ucfirst': lambda extr, string, *rest: ucfirst(string),

    'int': lambda extr, string, *rest: text_type(int(string)),

    'formatnum': lambda extr, string, *rest: re.sub(r'\.', ',', string), #comma is decimal separator! for dot use: 'formatnum': lambda extr, string, *rest: string

    'gender': lambda *args: '', # not supported #HjalmarrSv: not relevant here, because "such as in "inform the user on his/her talk page", not ns0.

    'plural': lambda *args: '', # not supported

    'padleft': lambda *args: '', # not supported

    'padright': lambda *args: '', # not supported

    'as of': lambda *args: '', # not supported #HjalmarrSv #this may be the wrong place in code for this function

    'plural': lambda *args: '', # not supported

    'grammar': lambda *args: '', # not supported
}

For tags in other languages - copy and add - in theory. Test! May actually be a template tag and may not work.

'date de naissance-': lambda extr, string, *rest: re.sub(r'\|', ' ', string),
mpagli commented 4 years ago

So in the end you managed to recover the missing numbers?

I personally went on with the solution of @DavidNemeskey, I get the html files from the zim archive using his code, then I have my own code to extract the text from the html. I end up with a relatively clean content, including all numbers. The html markup allows a lot of room to preprocess the data in the way you want.

HjalmarrSv commented 4 years ago

I'll check those out. Luckily disk space is affordable. Only time is lacking.

Yes, the Formatnum works. -The idea of using formatnum is kind of silly. If my locale is US and I read a Swedish text - why would I want US dot decimal separator when swedes use comma as decimal separator? And if it only is markup for autotranslation, then do you really want translation that needs all kinds of markup to work, especially when most text lacks but basic markup? Not counting about a million articles of robot text, having latin titles, or foreign language hill names with all kinds of markup, with the same sentence structure, not so useful for language studies.

I made a pull request of the code, since I am not sure everyone checks out the forks. The formatnum consists of three parts, the function (optional), the calling of the function in the parser tag list (a mandatory lamda) and the command line option (optional) if you want to change between comma and dot when calling WikiExtractor.

AdaCheng commented 4 years ago

Hi, I also came across this problem. Is there any way to fix it?

Example in wiki: The mean body mass of the wolf is 40 kg (88 lb), the smallest specimen recorded at 12 kg (26 lb) and the largest at 79.4 kg (175 lb).

output: The mean body mass of the wolf is , the smallest specimen recorded at and the largest at

HjalmarrSv commented 4 years ago

Yes!

Try adding the following: 'formatnum': lambda extr, string, *rest: re.sub(r'.', ',', string), #comma is decimal separator! in a separate row after the row with: parserFunctions = { (which is on line 1872 in the code in WikiExtractor.py) as discussed above. (if you put it on line 1911 it will be the last statement in the clause)

Depending on the language, you must decide on comma or dot as decimal separator. If problem persists then it is not formatnum that is the problem. It may be that someone has defined their own template for numbers, and the template translation does not catch it, because of errors in the template.

As explained above you can choose zim or cirrus archives as an alternative. They have expanded the templates already, not always correctly, but good enough I guess. You need to use another parser in that case.

Hope this works for you!

AdaCheng commented 4 years ago

Thanks for your reply. I found the reason inducing my problem is all these numbers contained in a template {{convert|xx|..} which has been filtered.

miguelwon commented 1 year ago

Any update on this issue?