Extracting links apparently broken

basaldella commented 7 years ago

Hello,

I'm using WikiExtractor for an academic project and I need to extract the pages from WikiNews while keeping the links. My problem is that the script, when called with the -l option, removes links instead of preserving them.

Take at example this news, titled Nobel Peace Prize awarded to Kenyan environmental activist. I download the latest dump, then I run the script as follows:

~/wikiextractor$ ./WikiExtractor.py -o extractedWithLinks -l enwikinews-latest-pages-meta-current.xml

I look for the file containing the text of the page:

~/wikiextractor$ cd extractedWithLinks/
~/wikiextractor/extractedWithLinks$ grep -r "Nobel Peace Prize awarded to Kenyan environmental activist" .
./AA/wiki_00:<doc id="1637" url="https://en.wikinews.org/wiki?curid=1637" title="Nobel Peace Prize awarded to Kenyan environmental activist">
...

If I look at the XML extracted by WikiExtractor it looks like this:

<doc id="1637" url="https://en.wikinews.org/wiki?curid=1637" title="Nobel Peace Prize awarded to Kenyan environmental activist">
Nobel Peace Prize awarded to Kenyan environmental activist

Dr Maathai is a member of parliament in Kenya, the country's deputy environmental minister, and holds a in from the University of Nairobi. For seven years she was the director of the in Kenya, and is most known for founding the — a non-governmental organization dedicated to environmental conservation and protecting forests. Since its founding in 1997, the organization claims to have planted over 30 million trees, in the process employing thousands of women — offering them empowerment, education and even family planning.

The GBM organises rural women in Kenya to participate in environmentally friendly activities such as reforestation; economically-conducive activities like eco-tourism and training in forestry and food processing; as well as community development. 
...
</doc>

As you can see, the first sentence of the page is missing:

OSLO — The 2004 Nobel Peace Prize was awarded today to Dr Wangari Maathai from Kenya. She is the first African woman to win the Peace prize, and the 12th woman to win the prize since its inception in 1901. The Nobel committee cited "her contribution to sustainable development, democracy and peace" as the reasons for awarding the prize. It is the first Peace prize awarded to an environmentalist.

And some of the links in the following sentences are missing as well. The extracted text is:

Dr Maathai is a member of parliament in Kenya, the country's deputy environmental minister, and holds a in from the University of Nairobi. For seven years she was the director of the in Kenya, and is most known for founding the — a non-governmental organization dedicated to environmental conservation and protecting forests. ...

While the original text reads (the missing links are in bold):

Dr Maathai is a member of parliament in Kenya, the country's deputy environmental minister, and holds a Ph.D. in anatomy from the University of Nairobi. For seven years she was the director of the Red Cross in Kenya, and is most known for founding the Green Belt Movement — a non-governmental organization dedicated to environmental conservation and protecting forests.

So: am I missing something in the configuration of WikiExtractor? Is it a bug? Are WikiNews dumps for some reason not supported, even if they should be identical in structure to the usual Wikipedia ones?

attardi commented 7 years ago

I suspect you had an error running the extractor, which failed to collect the definition for macro {{w}}. Please download version 2.73, remove the saved templates files and try again.

basaldella commented 7 years ago

OK. Does the latest commit fix the broken links as well?

attardi commented 7 years ago

It does not fix the problem of the first sentence either, since macro {{w}} uses an unsupported parser function #ifexists.

attardi commented 7 years ago

Try now.

basaldella commented 7 years ago

Hello,

I re-run the commands I listed in the first post and the result is exactly the same. To re-initialize the script, I just ran

rm -r wikiextractor
git clone https://github.com/attardi/wikiextractor.git

Is it enough or do I have to delete other files? Please note that the extracted files were in the wikiextractor folder.

basaldella commented 7 years ago

FYI, thanks to your suggestion about the broken {{w}} template, I made a workaround by sedding the original dump file into something readable by WikiExtractor.

I just run:

sed 's/{{w|\([^}]*\)}}/[[\1]]/g' enwikinews-latest-pages-meta-current.xml > filtered-enwikinews-latest-pages-meta-current.xml

This way, {{w}} links are converted to "normal" Wikipedia links, for example (from the example page linked above):

{{w|Oslo|OSLO}} &amp;mdash; The 2004 [[Nobel Peace Prize]] was awarded today to {{w|Wangari Maathai|Dr Wangari Maathai}} from [[Kenya]].

Becomes

[[Oslo|OSLO]] &amp;mdash; The 2004 [[Nobel Peace Prize]] was awarded today to [[Wangari Maathai|Dr Wangari Maathai]] from [[Kenya]].

So, I run the extractor on the resulting file:

./WikiExtractor.py -o extractedWithLinks -l filtered-enwikinews-latest-pages-meta-current.xml

The resulting XML is well formed:

<doc id="1637" url="https://en.wikinews.org/wiki?curid=1637" title="Nobel Peace Prize awarded to Kenyan environmental activist">
Nobel Peace Prize awarded to Kenyan environmental activist

<a href="Oslo">OSLO</a> — The 2004 <a href="Nobel%20Peace%20Prize">Nobel Peace Prize</a> was awarded today to <a href="Wangari%20Maathai">Dr Wangari Maathai</a> from <a href="Kenya">Kenya</a>. She is the first <a href="Africa">African</a> woman to win the Peace prize, and the 12th woman to win the prize since its inception in 1901. The Nobel committee cited "her contribution to sustainable development, democracy and peace" as the reasons for awarding the prize. It is the first Peace prize awarded to an environmentalist.

Dr Maathai is a member of parliament in Kenya, the country's deputy environmental minister, and holds a <a href="Ph.D.">Ph.D.</a> in <a href="anatomy">anatomy</a> from the University of Nairobi. For seven years she was the director of the <a href="Red%20Cross">Red Cross</a> in Kenya, and is most known for founding the <a href="Green%20Belt%20Movement">Green Belt Movement</a> — a non-governmental organization dedicated to environmental conservation and protecting forests. Since its founding in 1997, the organization claims to have planted over 30 million trees, in the process employing thousands of women — offering them empowerment, education and even family planning.
...
</doc>

It's a very, very, very dirty solution but it seems to work.

attardi commented 7 years ago

Sorry, the {{w}} issue has been solved by fixing the loading of templates. You probably still have an incomplete template file: please remove it and create it again.

basaldella commented 7 years ago

So which files should I remove? Removing the WikiExtractor folder and re-cloning the repo is not enough?

attardi commented 7 years ago

You should remove the file that was given as argument for the --templates option.

basaldella commented 7 years ago

As I wrote, I called the script with just the -l option, and I did not provide any template file to the extractor. Still, the generated XML appears to be broken. Am I missing something?

attardi commented 7 years ago

The template {{w}} is indeed defined in enwikinews-latest-pages-articles.xml.bz2 Try with that source file.

basaldella commented 7 years ago

Hello,

sorry to bother you again, but I had time to dig more in the project, and I still have problems. Following your suggestions, I ran

$ python WikiExtractor.py -o extractedWithLinks --templates 
../enwikinews-lastest-pages-articles.xml.bz2 ../enwikinews-latest-pages-articles.xml.bz2

The result is still not what I expected. Take for example this page. The output of the extractor is:

<doc id="817" url="https://en.wikinews.org/wiki?curid=817" title="Pope John Paul II meets Iraq's Ambassador">
Pope John Paul II meets Iraq's Ambassador

</doc>

..so, basically, the extactor wipes away all the content of the page. What could be the problem?

mayeulk commented 1 year ago

Myself, I had a similar issue on Wikinews; the way I solved it might help you (here, I completely remove the links; the way I find them may help you modify them):

Problem was I had:

{"id": "736", "revid": "70202", "url": "https://en.wikinews.org/wiki?curid=736", "title": "President of China lunches with Brazilian President", "text": ", the of the People's Republic of China had lunch today with the of Brazil, , at the \"Granja do Torto\", the President's country residence in the . Lunch was a traditional Brazilian with different kinds of meat. \nSome Brazilian ministers were present at the event: (Economy), (), (Agriculture), (Development), (), (Mines and Energy). Also present were ( company president) and Eduardo Dutra (, government oil company, president).\nThis meeting is part of a new agreement between Brazil and China where Brazil has recognized mainland China's status, and China has promised to buy more ."}

instead of:

{"id": "736", "revid": "70202", "url": "https://en.wikinews.org/wiki?curid=736", "title": "President of China lunches with Brazilian President", "text": "Hu Jintao, the President of the People's Republic of China had lunch today with the President of Brazil, Luiz In\u00e1cio Lula da Silva, at the \"Granja do Torto\", the President's country residence in the Brazilian Federal District. Lunch was a traditional Brazilian barbecue with different kinds of meat. \nSome Brazilian ministers were present at the event: Antonio Palocci (Economy), Eduardo Campos (Science and Technology), Roberto Rodrigues (Agriculture), Luiz Fernando Furlan (Development), Celso Amorim (Exterior Relations), Dilma Rousseff (Mines and Energy). Also present were Roger Agnelli (Vale do Rio Doce company president) and Eduardo Dutra (Petrobras, government oil company, president).\nThis meeting is part of a new political economy agreement between Brazil and China where Brazil has recognized mainland China's market economy status, and China has promised to buy more Brazilian products."}

Download the Wikinews dumps from https://dumps.wikimedia.org/backup-index.html Files: language_code: xx xxwikinew
Use WikiExtractor to extract text from the wiki pages (See old stats and language code at: https://stats.wikimedia.org/wikinews/EN/Sitemap.htm )
Get the sofwtare: https://github.com/attardi/wikiextractor

    pip install wikiextractor
    sed -e 's/{{w|[^|}]*|\([^|}]*\)}}/\1/g' enwikinews-20230920-pages-meta-current.xml  | sed -e 's/\[\[[^]|]*|\([^]|]*\)\]\]/\1/g' > enwikinews-20230920-pages-meta-current_parsed.xml
    wikiextractor -b 100M -o en enwikinews-20230920-pages-meta-current_parsed.xml

(You can add --json to get a json file instead; there are other options, e.g. for links).

This way, this raw Wikinews code:

Emails exchanged among {{w|United States Air Force|United States Air Force}} officials regarding a USD$23 billion dollar deal with aircraft manufacturer {{w|Boeing|Boeing}} have been entered into the public record. {{w|Senator|Senator}} {{w|John McCain|John McCain}} ({{w|United States Republican Party|R}}-[[Arizona|AZ]]) entered them into the {{w|Congressional Record|Congressional Record}} during a speech last week against the now-cancelled deal to lease 100 {{w|Aerial refueling|mid-air tanker}} aircraft from Boeing. will be rendered as:

Emails exchanged among United States Air Force officials regarding a USD$23 billion dollar deal with aircraft manufacturer Boeing have been entered into the public record. Senator John McCain (R-AZ) entered them into the Congressional Record during a speech last week against the now-cancelled deal to lease 100 mid-air tanker aircraft from Boeing.

attardi / wikiextractor

Extracting links apparently broken #112