Open johnPertoft opened 5 years ago
I guess I need to add a parser function here?
I've also noticed that in many cases with this parser function, the argument list is empty which seems to be related to source text like this ['formatnum:{{Stat/Sverige/Landskap/Befolkning|Götaland}}']
. These are words in Swedish leading me to believe that there would some definition of how to interpret these somewhere?
Edit: This specific template is described here (in Swedish). Is this problem just a matter of including this template definition as the --templates
argument?
I am seeing missing numbers in both Hungarian and English WP outputs. In Hungarian, the template {{szám|384402}}
is not expanded; an English example is {{cvt|384402|km|mi}}
from the Moon's WP page. I preprocessed both dumps, and both Template:Cvt
and Sablon:Szám
are in the template listings.
It seems that when the code tries to expand szám
or cvt
, it either misparses the expanded macro (Hungarian case), or fails to even collect it properly (English case).
I attached two logs of what is happening. They are full of random stuff, for which I apologize, but it gives an idea of what is happening.
Thanks for responding. I didn't have time to look further into this, but do you know if this script should be able to properly parse and look up values if given the correct input files?
I've learnt that a Wikipedia dump consists of a lot of files, some of which are sql files to recreate databases which I'm guessing are needed to fill in certain values, like maybe my previous example {{Stat/Sverige/Landskap/Befolkning|Götaland}}
. This should expand into the population size of a certain area of Sweden. Do you know if this script is able to retrieve this information from somewhere given the right input arguments?
As I understand, this script takes as input a single bz2 file, so I just gave it the full pages-articles bz2 dump. That one includes the template pages, and they are extracted into the templates file (--templates
), too -- only substitution than fails. So I guess we are stuck with this bug for now.
I have noticed the same. Especially on newer machine generated articles, where it costs nothing to burden the text with a lot of markup, necessary, or not.
Regarding numbers, they are described as formatting here: https://www.mediawiki.org/wiki/Help:Magic_words
There are a lot of possibilities of user generated formatting. They could be regarded as too much work for this tool, unless it impacts the uses of the text. Some projects recommend the json output, which may be the way to go if there are less errors. It may be easier to just remove key words from json, than to interpret a dump to text.
Has anyone fixed this problem either way?, i.e. fixed the parsing or cleaning functions? Or are there better dumps or tools?
Same here. Would love to find a way to fix that.
I did a quick fix that may work. Since template expansion seems to be outdated, why not used already expanded text: the cirrussearch dumps. It seems to work. And there are many mirrors to download from.
I had to do a minor code update, and added a text-only option. Please feel free to copy: https://github.com/HjalmarrSv/wikiextractor/blob/master/cirrus-extract.py
It may be possible to integrate the cirrus reader into WikiExtractor, if there is such a need.
Thx for sharing! I tried the same thing, it works well to get the text with numbers, but I also get all the references mixed with the text. I'm still not sure how I can filter those out properly.
With two spaces before the caret references were still there. Changing to one space removes all of them: text = re.sub(r' \^ .*$', '', text) the $ is not needed, re is greedy so will match to the end even without the end sign ($) Still some links left though...
This would only remove a tiny fraction of references. Also, I'm not sure what you want to do with your wiki corpus, but in the eventuality that you'd like to train some language model, here is what you will get from the cirrus dump:
UNITY is a programming language constructed by K. Mani Chandy and Jayadev Misra for their book Parallel Program Design: A Foundation. It is a theoretical language which focuses on what, instead of where, when or how. The language contains no method of flow control, and program statements run in a nondeterministic way until statements cease to cause changes during execution. This allows for programs to run indefinitely, such as auto-pilot or power plant safety systems, as well as programs that would normally terminate (which here converge to a fixed point). All statements are assignments, and are separated by #. A statement can consist of multiple assignments, of the form a,b,c := x,y,z, or a := x || b := y || c := z. You can also have a quantified statement list, <# x,y : expression :: statement>, where x and y are chosen randomly among the values that satisfy expression. A quantified assignment is similar. In <|| x,y : expression :: statement >, statement is executed simultaneously for all pairs of x and y that satisfy expression. Bubble sort the array by comparing adjacent numbers, and swapping them if they are in the wrong order. Using Θ ( n ) {\displaystyle \Theta (n)} expected time, Θ ( n ) {\displaystyle \Theta (n)} processors and Θ ( n 2 ) {\displaystyle \Theta (n^{2})} expected work. The reason you only have Θ ( n ) {\displaystyle \Theta (n)} expected time, is that k is always chosen randomly from { 0 , 1 } {\displaystyle \{0,1\}} . This can be fixed by flipping k manually. Program bubblesort declare n: integer, A: array [0..n-1] of integer initially n = 20 # <|| i : 0 <= i and i < n :: A[i] = rand() % 100 > assign <# k : 0 <= k < 2 :: <|| i : i % 2 = k and 0 <= i < n - 1 :: A[i], A[i+1] := A[i+1], A[i] if A[i] > A[i+1] > > end You can sort in Θ ( log n ) {\displaystyle \Theta (\log n)} time with rank-sort. You need Θ ( n 2 ) {\displaystyle \Theta (n^{2})} processors, and do Θ ( n 2 ) {\displaystyle \Theta (n^{2})} work. Program ranksort declare n: integer, A,R: array [0..n-1] of integer initially n = 15 # <|| i : 0 <= i < n :: A[i], R[i] = rand() % 100, i > assign <|| i : 0 <= i < n :: R[i] := <+ j : 0 <= j < n and (A[j] < A[i] or (A[j] = A[i] and j < i)) :: 1 > > # <|| i : 0 <= i < n :: A[R[i]] := A[i] > end Using the Floyd–Warshall algorithm all pairs shortest path algorithm, we include intermediate nodes iteratively, and get Θ ( n ) {\displaystyle \Theta (n)} time, using Θ ( n 2 ) {\displaystyle \Theta (n^{2})} processors and Θ ( n 3 ) {\displaystyle \Theta (n^{3})} work. Program shortestpath declare n,k: integer, D: array [0..n-1, 0..n-1] of integer initially n = 10 # k = 0 # <|| i,j : 0 <= i < n and 0 <= j < n :: D[i,j] = rand() % 100 > assign <|| i,j : 0 <= i < n and 0 <= j < n :: D[i,j] := min(D[i,j], D[i,k] + D[k,j]) > || k := k + 1 if k < n - 1 end We can do this even faster. The following programs computes all pairs shortest path in Θ ( log 2 n ) {\displaystyle \Theta (\log ^{2}n)} time, using Θ ( n 3 ) {\displaystyle \Theta (n^{3})} processors and Θ ( n 3 log n ) {\displaystyle \Theta (n^{3}\log n)} work. Program shortestpath2 declare n: integer, D: array [0..n-1, 0..n-1] of integer initially n = 10 # <|| i,j : 0 <= i < n and 0 <= j < n :: D[i,j] = rand() % 10 > assign <|| i,j : 0 <= i < n and 0 <= j < n :: D[i,j] := min(D[i,j], <min k : 0 <= k < n :: D[i,k] + D[k,j] >) > end After round r {\displaystyle r} , D[i,j] contains the length of the shortest path from i {\displaystyle i} to j {\displaystyle j} of length 0 … r {\displaystyle 0\dots r} . In the next round, of length 0 … 2 r {\displaystyle 0\dots 2r} , and so on. K. Mani Chandy and Jayadev Misra (1988) Parallel Program Design: A Foundation.
As opposed to the same article extracted from the standard wiki dump (using wikiextractor):
UNITY is a programming language constructed by K. Mani Chandy and Jayadev Misra for their book "Parallel Program Design: A Foundation". It is a theoretical language which focuses on "what", instead of "where", "when" or "how". The language contains no method of flow control, and program statements run in a nondeterministic way until statements cease to cause changes during execution. This allows for programs to run indefinitely, such as auto-pilot or power plant safety systems, as well as programs that would normally terminate (which here converge to a fixed point). Section::::Description. All statements are assignments, and are separated by codice_1. A statement can consist of multiple assignments, of the form codice_2, or codice_3. You can also have a "quantified statement list", codice_4, where x and y are chosen randomly among the values that satisfy "expression". A "quantified assignment" is similar. In codice_5, "statement" is executed simultaneously for "all" pairs of codice_6 and codice_7 that satisfy "expression". Section::::Examples. Section::::Bubble sort. Bubble sort the array by comparing adjacent numbers, and swapping them if they are in the wrong order. Using formula_1 expected time, formula_1 processors and formula_3 expected work. The reason you only have formula_1 "expected" time, is that codice_8 is always chosen randomly from formula_5. This can be fixed by flipping codice_8 manually. Section::::Rank-sort. You can sort in formula_6 time with rank-sort. You need formula_3 processors, and do formula_3 work. Section::::Floyd–Warshall algorithm. Using the Floyd–Warshall algorithm all pairs shortest path algorithm, we include intermediate nodes iteratively, and get formula_1 time, using formula_3 processors and formula_11 work. We can do this even faster. The following programs computes all pairs shortest path in formula_12 time, using formula_11 processors and formula_14 work. After round formula_15, codice_10 contains the length of the shortest path from formula_16 to formula_17 of length formula_18. In the next round, of length formula_19, and so on.
It seems to me that working with the cirrus dump makes it really difficult to filter out annoying features such as formulas, references, ... I personally would feel more confident training on a corpus missing few numbers here and there rather than a corpus containing a significant fraction of noisy references and markup.
Just to chime in with another alternative: I ended up using the Kiwix Wikipedia dumps. The dump consists of ZIM archives, which contain the pages in HTML format. This makes parsing the infoboxes and similar templates all but impossible, but for those of us who only need the text, it is actually easier to process than the abomination that Mediawiki markup is.
For those interested, I have created a repo for parsing these ZIM dumps into a very limited form of HTML (only <p>
and list tags). I managed to parse both the Hungarian and English dumps with it.
Great with alternatives! ZIM is new to me.
I do not like math. :-) Maybe filter out math articles... I have not thought about them. Will now!
As johnPertoft pointed out above there is a place where formatnum should be. These lines can be added there. As far as I know nothing improves, but code readability. If there is a function that just drops the tags {{, }} when unknown function, then knowing the function may ruin that. Otherwise this is a step forward. The formatnum in itself would be simple to parse, but there is the template connection.
'#dateformat': lambda args: '', # not supported '#formatdate': lambda args: '', # not supported '#tag': lambda args: '', # not supported 'formatnum': lambda args: '', # not supported 'gender': lambda args: '', # not supported 'plural': lambda args: '', # not supported 'padleft': lambda args: '', # not supported 'padright': lambda args: '', # not supported 'plural': lambda *args: '', # not supported
I would also like to put 'as of' here, but find no support for this assumption. I proposed another temporary solution in the As of-thread. 'as of': lambda *args: '', # not supported. Note: this may be the wrong place in code for this function
This bug seems to be the root cause of most invalid sentences on http://voice.mozilla.org.
In above thread {{date de naissance-|17|octobre|1973}} is mentioned as not handled by WikiExtractor. Which I guess is a {{dateformat}} or {{formatdate}} parser tag.
There is a proposal for making a centralised function for handling parser tags in multiple languages. In this proposal there is also a list of a number of problems the lack of a central function is causing.
Two of the problems identified are relevant to this thread:
"For people editing in different languages, templates make translation harder. When translating a page, templates are much harder to handle than the article text (“prose”), whether the translation is done manually or with Content Translation. Users often have to skip the template or to correct it after the article was published. This also causes abandoning translations in progress, because template translation looks intimidating."
"Content Translation has a template adaptation feature, which automates some parts of this process, but it works only if a corresponding template exists in both languages, and if all the parameters were meticulously mapped by the template maintainers. This must be done for each template in each language separately and manually, and continuously maintained when the source template changes. This happens even though the templates’ function across the languages is the same most of the time."
https://www.mediawiki.org/wiki/Global_templates/Proposed_specification [This page was last edited on 19 January 2020, at 12:03]
Some problems that have been identified must reasonably be solved at the level where they are created. Automatic template expansion and parsing, primarily with tag-names in english, may be the furthest this project can possibly handle. That could mean support for the idea of pre-parsing the .xml-code for translating it to non-local tags.
Edit: Ideally the "pre-parsing"would be done by the template maintaners in the templates themselves. Otherwise, pending a central function, there would be a need to write a translator by anyone interested in a specific language.
'formatnum': lambda extr, string, *rest: string,
seems to work for me (disambig also works suddenly, so may also depend on something else I did/reverted) However, decimal separator is dot (.), languages using comma are in the cold with this version. Maybe use strip (edit: re.split?) if text only interesting? swedish lake areas "area på 0.0518 kvadratkilometer" will then become 0 sqkm in many robot texts. what will "robot intelligence" "think" about point objects and physical properties of this world? :-)
this should work on other parser tags where keeping the value is enough.
If someone else can replicate the result, we could call this problem closed.
In the same way as with 'int': lambda extr, string, rest: text_type(int(string)), it is possbile to do an operation on the string. With import decimal there is not even a need to convert from string, to be able to use roundup (to avoid zero values) if decimal markers are to be avoided.
if a formatnum fails now, it is probably because a template was not expanded before parsing formatnum, in which case the template possibly is still in the text.
I have not tried, but if you are not interested in numbers, replace the string argument with "number" or "nummer" or "your choice". Could work.
A function could be called that localized the decimal point (. or ,) to user locale or article language.
I see no reason why it should not work to copy "formatnum ..." and also have other languages., thus not getting stuck on the "template expansion and translation"-problem above.
I will try the dateformat and formatdate with re.sub, or equivalent. You can of course use the formatnum example and get the unprocessed argument (with pipe (|) symbols). Also here you can have any language as parser tag, to be able to process wikis in other languages.
If you want comma then this works for me:
'formatnum': lambda extr, string, *rest: re.sub(r'\.', ',', string), #comma is decimal separator!
A pre-formated number may turn to rubbish, 1.000.000,00, would beome 1,000,000,00, or 1,000,000.00, would become 1,000,000,00. A function for proper parsing is needed, even though error rates may be small and even acceptable.
The entire list looks like this. Note that tag, dateformat and formatdate are experimental. I have not found them in my .xml.
parserFunctions = {
'#expr': sharp_expr,
'#if': sharp_if,
'#ifeq': sharp_ifeq,
'#iferror': sharp_iferror,
'#ifexpr': lambda *args: '', # not supported
'#ifexist': lambda extr, title, ifex, ifnex: extr.expand(ifnex), # assuming title is not present
'#rel2abs': lambda *args: '', # not supported
'#switch': sharp_switch,
'#language': lambda *args: '', # not supported
'#time': lambda *args: '', # not supported
'#timel': lambda *args: '', # not supported
'#titleparts': lambda *args: '', # not supported
'#dateformat': lambda extr, string, *rest: re.sub(r'\|', ' ', string), #raw: '#dateformat': lambda extr, string, *rest: string
'#formatdate': lambda extr, string, *rest: re.sub(r'\|', ' ', string), #raw: '#formatdate': lambda extr, string, *rest: string
'#tag': lambda xtr, string, *rest: string,
# This function is used in some pages to construct links
# http://meta.wikimedia.org/wiki/Help:URL
'urlencode': lambda extr, string, *rest: quote(string.encode('utf-8')),
'lc': lambda extr, string, *rest: string.lower() if string else '',
'lcfirst': lambda extr, string, *rest: lcfirst(string),
'uc': lambda extr, string, *rest: string.upper() if string else '',
'ucfirst': lambda extr, string, *rest: ucfirst(string),
'int': lambda extr, string, *rest: text_type(int(string)),
'formatnum': lambda extr, string, *rest: re.sub(r'\.', ',', string), #comma is decimal separator! for dot use: 'formatnum': lambda extr, string, *rest: string
'gender': lambda *args: '', # not supported #HjalmarrSv: not relevant here, because "such as in "inform the user on his/her talk page", not ns0.
'plural': lambda *args: '', # not supported
'padleft': lambda *args: '', # not supported
'padright': lambda *args: '', # not supported
'as of': lambda *args: '', # not supported #HjalmarrSv #this may be the wrong place in code for this function
'plural': lambda *args: '', # not supported
'grammar': lambda *args: '', # not supported
}
For tags in other languages - copy and add - in theory. Test! May actually be a template tag and may not work.
'date de naissance-': lambda extr, string, *rest: re.sub(r'\|', ' ', string),
So in the end you managed to recover the missing numbers?
I personally went on with the solution of @DavidNemeskey, I get the html files from the zim archive using his code, then I have my own code to extract the text from the html. I end up with a relatively clean content, including all numbers. The html markup allows a lot of room to preprocess the data in the way you want.
I'll check those out. Luckily disk space is affordable. Only time is lacking.
Yes, the Formatnum works. -The idea of using formatnum is kind of silly. If my locale is US and I read a Swedish text - why would I want US dot decimal separator when swedes use comma as decimal separator? And if it only is markup for autotranslation, then do you really want translation that needs all kinds of markup to work, especially when most text lacks but basic markup? Not counting about a million articles of robot text, having latin titles, or foreign language hill names with all kinds of markup, with the same sentence structure, not so useful for language studies.
I made a pull request of the code, since I am not sure everyone checks out the forks. The formatnum consists of three parts, the function (optional), the calling of the function in the parser tag list (a mandatory lamda) and the command line option (optional) if you want to change between comma and dot when calling WikiExtractor.
Hi, I also came across this problem. Is there any way to fix it?
Example in wiki:
The mean body mass of the wolf is 40 kg (88 lb), the smallest specimen recorded at 12 kg (26 lb) and the largest at 79.4 kg (175 lb).
output:
The mean body mass of the wolf is , the smallest specimen recorded at and the largest at
Yes!
Try adding the following: 'formatnum': lambda extr, string, *rest: re.sub(r'.', ',', string), #comma is decimal separator! in a separate row after the row with: parserFunctions = { (which is on line 1872 in the code in WikiExtractor.py) as discussed above. (if you put it on line 1911 it will be the last statement in the clause)
Depending on the language, you must decide on comma or dot as decimal separator. If problem persists then it is not formatnum that is the problem. It may be that someone has defined their own template for numbers, and the template translation does not catch it, because of errors in the template.
As explained above you can choose zim or cirrus archives as an alternative. They have expanded the templates already, not always correctly, but good enough I guess. You need to use another parser in that case.
Hope this works for you!
Thanks for your reply. I found the reason inducing my problem is all these numbers contained in a template {{convert|xx|..} which has been filtered.
Any update on this issue?
I tried to use this for a dump of Swedish Wikipedia but in a lot of cases I noticed that numbers are often missing from the output files. After some cross referencing between the output json files and the source xml file it seems related to
{{formatnum}}
which is a Magic word.(parts of) Example article in xml file with this problem:
Corresponding article in json output file:
In the articles without this problem, number are (as far as I've seen) written as plain text without any magic words.
Is there any way to avoid this and other similar missing words? This issue is probably related to #151 and #153.