Closed basaldella closed 7 years ago
I suspect you had an error running the extractor, which failed to collect the definition for macro {{w}}. Please download version 2.73, remove the saved templates files and try again.
OK. Does the latest commit fix the broken links as well?
It does not fix the problem of the first sentence either, since macro {{w}} uses an unsupported parser function #ifexists.
Try now.
Hello,
I re-run the commands I listed in the first post and the result is exactly the same. To re-initialize the script, I just ran
rm -r wikiextractor
git clone https://github.com/attardi/wikiextractor.git
Is it enough or do I have to delete other files? Please note that the extracted files were in the wikiextractor
folder.
FYI, thanks to your suggestion about the broken {{w}}
template, I made a workaround by sed
ding the original dump file into something readable by WikiExtractor.
I just run:
sed 's/{{w|\([^}]*\)}}/[[\1]]/g' enwikinews-latest-pages-meta-current.xml > filtered-enwikinews-latest-pages-meta-current.xml
This way, {{w}}
links are converted to "normal" Wikipedia links, for example (from the example page linked above):
{{w|Oslo|OSLO}} — The 2004 [[Nobel Peace Prize]] was awarded today to {{w|Wangari Maathai|Dr Wangari Maathai}} from [[Kenya]].
Becomes
[[Oslo|OSLO]] — The 2004 [[Nobel Peace Prize]] was awarded today to [[Wangari Maathai|Dr Wangari Maathai]] from [[Kenya]].
So, I run the extractor on the resulting file:
./WikiExtractor.py -o extractedWithLinks -l filtered-enwikinews-latest-pages-meta-current.xml
The resulting XML is well formed:
<doc id="1637" url="https://en.wikinews.org/wiki?curid=1637" title="Nobel Peace Prize awarded to Kenyan environmental activist">
Nobel Peace Prize awarded to Kenyan environmental activist
<a href="Oslo">OSLO</a> — The 2004 <a href="Nobel%20Peace%20Prize">Nobel Peace Prize</a> was awarded today to <a href="Wangari%20Maathai">Dr Wangari Maathai</a> from <a href="Kenya">Kenya</a>. She is the first <a href="Africa">African</a> woman to win the Peace prize, and the 12th woman to win the prize since its inception in 1901. The Nobel committee cited "her contribution to sustainable development, democracy and peace" as the reasons for awarding the prize. It is the first Peace prize awarded to an environmentalist.
Dr Maathai is a member of parliament in Kenya, the country's deputy environmental minister, and holds a <a href="Ph.D.">Ph.D.</a> in <a href="anatomy">anatomy</a> from the University of Nairobi. For seven years she was the director of the <a href="Red%20Cross">Red Cross</a> in Kenya, and is most known for founding the <a href="Green%20Belt%20Movement">Green Belt Movement</a> — a non-governmental organization dedicated to environmental conservation and protecting forests. Since its founding in 1997, the organization claims to have planted over 30 million trees, in the process employing thousands of women — offering them empowerment, education and even family planning.
...
</doc>
It's a very, very, very dirty solution but it seems to work.
Sorry, the {{w}} issue has been solved by fixing the loading of templates. You probably still have an incomplete template file: please remove it and create it again.
So which files should I remove? Removing the WikiExtractor folder and re-cloning the repo is not enough?
You should remove the file that was given as argument for the --templates option.
As I wrote, I called the script with just the -l
option, and I did not provide any template file to the extractor. Still, the generated XML appears to be broken. Am I missing something?
The template {{w}} is indeed defined in enwikinews-latest-pages-articles.xml.bz2 Try with that source file.
Hello,
sorry to bother you again, but I had time to dig more in the project, and I still have problems. Following your suggestions, I ran
$ python WikiExtractor.py -o extractedWithLinks --templates
../enwikinews-lastest-pages-articles.xml.bz2 ../enwikinews-latest-pages-articles.xml.bz2
The result is still not what I expected. Take for example this page. The output of the extractor is:
<doc id="817" url="https://en.wikinews.org/wiki?curid=817" title="Pope John Paul II meets Iraq's Ambassador">
Pope John Paul II meets Iraq's Ambassador
</doc>
..so, basically, the extactor wipes away all the content of the page. What could be the problem?
Myself, I had a similar issue on Wikinews; the way I solved it might help you (here, I completely remove the links; the way I find them may help you modify them):
Problem was I had:
{"id": "736", "revid": "70202", "url": "https://en.wikinews.org/wiki?curid=736", "title": "President of China lunches with Brazilian President", "text": ", the of the People's Republic of China had lunch today with the of Brazil, , at the \"Granja do Torto\", the President's country residence in the . Lunch was a traditional Brazilian with different kinds of meat. \nSome Brazilian ministers were present at the event: (Economy), (), (Agriculture), (Development), (), (Mines and Energy). Also present were ( company president) and Eduardo Dutra (, government oil company, president).\nThis meeting is part of a new agreement between Brazil and China where Brazil has recognized mainland China's status, and China has promised to buy more ."}
instead of:
{"id": "736", "revid": "70202", "url": "https://en.wikinews.org/wiki?curid=736", "title": "President of China lunches with Brazilian President", "text": "Hu Jintao, the President of the People's Republic of China had lunch today with the President of Brazil, Luiz In\u00e1cio Lula da Silva, at the \"Granja do Torto\", the President's country residence in the Brazilian Federal District. Lunch was a traditional Brazilian barbecue with different kinds of meat. \nSome Brazilian ministers were present at the event: Antonio Palocci (Economy), Eduardo Campos (Science and Technology), Roberto Rodrigues (Agriculture), Luiz Fernando Furlan (Development), Celso Amorim (Exterior Relations), Dilma Rousseff (Mines and Energy). Also present were Roger Agnelli (Vale do Rio Doce company president) and Eduardo Dutra (Petrobras, government oil company, president).\nThis meeting is part of a new political economy agreement between Brazil and China where Brazil has recognized mainland China's market economy status, and China has promised to buy more Brazilian products."}
pip install wikiextractor
sed -e 's/{{w|[^|}]*|\([^|}]*\)}}/\1/g' enwikinews-20230920-pages-meta-current.xml | sed -e 's/\[\[[^]|]*|\([^]|]*\)\]\]/\1/g' > enwikinews-20230920-pages-meta-current_parsed.xml
wikiextractor -b 100M -o en enwikinews-20230920-pages-meta-current_parsed.xml
(You can add --json
to get a json file instead; there are other options, e.g. for links).
This way, this raw Wikinews code:
Emails exchanged among {{w|United States Air Force|United States Air Force}} officials regarding a USD$23 billion dollar deal with aircraft manufacturer {{w|Boeing|Boeing}} have been entered into the public record. {{w|Senator|Senator}} {{w|John McCain|John McCain}} ({{w|United States Republican Party|R}}-[[Arizona|AZ]]) entered them into the {{w|Congressional Record|Congressional Record}} during a speech last week against the now-cancelled deal to lease 100 {{w|Aerial refueling|mid-air tanker}} aircraft from Boeing.
will be rendered as:
Emails exchanged among United States Air Force officials regarding a USD$23 billion dollar deal with aircraft manufacturer Boeing have been entered into the public record. Senator John McCain (R-AZ) entered them into the Congressional Record during a speech last week against the now-cancelled deal to lease 100 mid-air tanker aircraft from Boeing.
Hello,
I'm using WikiExtractor for an academic project and I need to extract the pages from WikiNews while keeping the links. My problem is that the script, when called with the
-l
option, removes links instead of preserving them.Take at example this news, titled Nobel Peace Prize awarded to Kenyan environmental activist. I download the latest dump, then I run the script as follows:
I look for the file containing the text of the page:
If I look at the XML extracted by WikiExtractor it looks like this:
As you can see, the first sentence of the page is missing:
And some of the links in the following sentences are missing as well. The extracted text is:
While the original text reads (the missing links are in bold):
So: am I missing something in the configuration of WikiExtractor? Is it a bug? Are WikiNews dumps for some reason not supported, even if they should be identical in structure to the usual Wikipedia ones?