clab / wikipedia-parallel-titles

Tools for extracting parallel corpora from article titles across languages in Wikipedia
72 stars 17 forks source link

scripts not working with unicode input on mac #1

Open roeeaharoni opened 9 years ago

roeeaharoni commented 9 years ago

Hi!

I Tried to use the tool according to the readme file, on macosx, with hebrew as the source language and arabic as the target. When I executed the following command (after installing the dependencies):

./build-corpus.sh ar hewiki-20141102 > titles_he_ar.txt

I got the following output:

Target language code: ar Using hewiki-20141102-langlinks.sql.gz Using hewiki-20141102-page.sql.gz Reading page data from hewiki-20141102-page.sql.gz... iconv: conversion from utf8 unsupported iconv: try 'iconv -l' to get the list of supported encodings read 0 documents Reading langlinks data from hewiki-20141102-langlinks.sql.gz... iconv: conversion from utf8 unsupported iconv: try 'iconv -l' to get the list of supported encodings read 0 documents

I tried to fix this by changing the perl scripts that called iconv with parameter 'utf8' to call it with 'utf-8', and it seems to work fine now.

Best regards, Roee

randinterval commented 8 years ago

Hi there, Can you upload the updated script?I seem to be having some issues. saad@Arc:~/Desktop/wiki$ ./build-corpus.sh en enwiki-20151102 > titles.txt Target language code: en Using enwiki-20151102-langlinks.sql.gz Using enwiki-20151102-page.sql.gz Reading page data from enwiki-20151102-page.sql.gz... read 37804388 documents Reading langlinks data from enwiki-20151102-langlinks.sql.gz... read 0 documents saad@Arc:~/Desktop/wiki$ ./build-corpus.sh ur urwiki-20151123 > titles.txt Target language code: ur Using urwiki-20151123-langlinks.sql.gz Using urwiki-20151123-page.sql.gz Reading page data from urwiki-20151123-page.sql.gz... read 401280 documents Reading langlinks data from urwiki-20151123-langlinks.sql.gz... read 0 documents

randinterval commented 8 years ago

@roeeaharoni @redpony @wammar

wammar commented 8 years ago

Hi Saad,

Try the following command: ./build-corpus.sh en urwiki-20151123

Waleed On Dec 1, 2015 9:46 AM, "Saad Ahmed" notifications@github.com wrote:

@roeeaharoni https://github.com/roeeaharoni @redpony https://github.com/redpony @wammar https://github.com/wammar

— Reply to this email directly or view it on GitHub https://github.com/clab/wikipedia-parallel-titles/issues/1#issuecomment-161044919 .

randinterval commented 8 years ago

Hi Waleed, Thank you so much! :+1: @wammar

randinterval commented 8 years ago

Hey @wammar , Hope you're well - I also want to extract entire articles in English and Target Language (Urdu) versions. Could you please point me in the right direction?I haven't programming in Perl before, so just little bit confused.

wammar commented 8 years ago

Sorry Saad but I can hardly write perl myself. On Dec 14, 2015 4:14 AM, "Saad Ahmed" notifications@github.com wrote:

Hey @wammar https://github.com/wammar , Hope you're well - I also want to extract entire articles in English and Target Language (Urdu) versions. Could you please point me in the right direction?I haven't programming Perl before, so just little bit confused.

— Reply to this email directly or view it on GitHub https://github.com/clab/wikipedia-parallel-titles/issues/1#issuecomment-164385036 .

randinterval commented 8 years ago

Hey @roeeaharoni I also want to extract entire articles in English and Target Language (Urdu) versions. Could you please point me in the right direction?I haven't programming in Perl before, so just little bit confused.

ghost commented 7 years ago

if you're getting the following error:

iconv: conversion from utf8 unsupported
iconv: try 'iconv -l' to get the list of supported encodings

Go to scripts > extract.pl change two instances of "utf8" to "utf-8" in Line 29 & Line 53. It will work.

PS: This change worked on Ubuntu & Mac

imrrahul commented 6 years ago

@randinterval @wammar please I am getting below error --- screenshot from 2018-06-03 00-29-13

twielfaert commented 6 years ago

@imrrahul You're using both the wrong path and the wrong files (enwikivoyage).