Open roeeaharoni opened 9 years ago
Hi there, Can you upload the updated script?I seem to be having some issues. saad@Arc:~/Desktop/wiki$ ./build-corpus.sh en enwiki-20151102 > titles.txt Target language code: en Using enwiki-20151102-langlinks.sql.gz Using enwiki-20151102-page.sql.gz Reading page data from enwiki-20151102-page.sql.gz... read 37804388 documents Reading langlinks data from enwiki-20151102-langlinks.sql.gz... read 0 documents saad@Arc:~/Desktop/wiki$ ./build-corpus.sh ur urwiki-20151123 > titles.txt Target language code: ur Using urwiki-20151123-langlinks.sql.gz Using urwiki-20151123-page.sql.gz Reading page data from urwiki-20151123-page.sql.gz... read 401280 documents Reading langlinks data from urwiki-20151123-langlinks.sql.gz... read 0 documents
@roeeaharoni @redpony @wammar
Hi Saad,
Try the following command: ./build-corpus.sh en urwiki-20151123
Waleed On Dec 1, 2015 9:46 AM, "Saad Ahmed" notifications@github.com wrote:
@roeeaharoni https://github.com/roeeaharoni @redpony https://github.com/redpony @wammar https://github.com/wammar
— Reply to this email directly or view it on GitHub https://github.com/clab/wikipedia-parallel-titles/issues/1#issuecomment-161044919 .
Hi Waleed, Thank you so much! :+1: @wammar
Hey @wammar , Hope you're well - I also want to extract entire articles in English and Target Language (Urdu) versions. Could you please point me in the right direction?I haven't programming in Perl before, so just little bit confused.
Sorry Saad but I can hardly write perl myself. On Dec 14, 2015 4:14 AM, "Saad Ahmed" notifications@github.com wrote:
Hey @wammar https://github.com/wammar , Hope you're well - I also want to extract entire articles in English and Target Language (Urdu) versions. Could you please point me in the right direction?I haven't programming Perl before, so just little bit confused.
— Reply to this email directly or view it on GitHub https://github.com/clab/wikipedia-parallel-titles/issues/1#issuecomment-164385036 .
Hey @roeeaharoni I also want to extract entire articles in English and Target Language (Urdu) versions. Could you please point me in the right direction?I haven't programming in Perl before, so just little bit confused.
if you're getting the following error:
iconv: conversion from utf8 unsupported
iconv: try 'iconv -l' to get the list of supported encodings
Go to scripts > extract.pl change two instances of "utf8" to "utf-8" in Line 29 & Line 53. It will work.
PS: This change worked on Ubuntu & Mac
@randinterval @wammar please I am getting below error ---
@imrrahul You're using both the wrong path and the wrong files (enwikivoyage).
Hi!
I Tried to use the tool according to the readme file, on macosx, with hebrew as the source language and arabic as the target. When I executed the following command (after installing the dependencies):
./build-corpus.sh ar hewiki-20141102 > titles_he_ar.txt
I got the following output:
Target language code: ar Using hewiki-20141102-langlinks.sql.gz Using hewiki-20141102-page.sql.gz Reading page data from hewiki-20141102-page.sql.gz... iconv: conversion from utf8 unsupported iconv: try 'iconv -l' to get the list of supported encodings read 0 documents Reading langlinks data from hewiki-20141102-langlinks.sql.gz... iconv: conversion from utf8 unsupported iconv: try 'iconv -l' to get the list of supported encodings read 0 documents
I tried to fix this by changing the perl scripts that called iconv with parameter 'utf8' to call it with 'utf-8', and it seems to work fine now.
Best regards, Roee