amittai / cynical

Cynical data selection
MIT License
20 stars 7 forks source link

jaded.all_other_data.en is empty #1

Closed surafelml closed 6 years ago

surafelml commented 6 years ago

Hi @amittai After running the algorithm using toy files (representative_sentences.en, .all_other_data.en), the execution finishes with an empty Jaded text file. Any hint on what's going on?

Many Thanks!

amittai commented 6 years ago

Hi -- That shouldn't be the case, so let's try to figure out why. Could you tell me what's in the .stderr file? Is .stdout empty? Cheers, ~amittai

surafelml commented 6 years ago

For instance I tried to run the algorithm using "representative_sentences.en --> English corpora with 200K sentences and "representative_sentences.en --> a separate English corpora with 400K sentences", here are the outputs:

*.stdout

boring000 910 boring0 991 dubious 165 boring0000 620 boring__00 979

*.stderr:

I have tried to see what's going on at line 273 of amittai-cynical-selection.pl and it's doesn't go in the loop, which I believe is responsible for populating the jaded file with sorted sentences

` print STDERR "running max $debuggingcounter iterations!"; while ($debuggingcounter > 0){

get best word from our sorted list

last unless (@word_gain_estimates);

`

Thanks!!

amittai commented 6 years ago

It looks like you have the REPRESENTATIVE corpus to be the same as the UNADAPTED corpus. Is that the intention? That forces all of the probability ratios to be 1, which means that the method thinks there is no discriminative information in the vocabulary.

In the domain adaptation scenario (most common), the REPRESENTATIVE corpus is the data you wish you were good at (often fairly small -- sounds like your 200k set), and the UNADAPTED corpus is the data you actually have available for training (often much larger, sounds like your 400k set). Normally also UNADAPT = AVAILABLE.

surafelml commented 6 years ago

Apologies, for the typo in my previous comment. I specified the 400k as the "representative_sentences.en", which was actually the "all_other_data.en". As you have also mentioned out I have doubled checked this, however, to end up with the same kind issue (an empty "jaded.representative_sentences.en" file).

amittai commented 6 years ago

Hullo -- It's actually the info in *.stdout that's making me suspicious, not the typo! STDOUT is the entire vocabulary that the selection method is using. According to that file, absolutely every element of the lexicon appears with the same probability in your REPRESENTATIVE and UNADAPTED corpora (all words have been replaced with special tokens that mean "the ratio is the same"). That's really really unusual, and i'm used to seeing it when i accidentally use the same file twice :)

1) do you get the same result if you delete absolutely everything (vocab files, counts, fix, squish, jaded* etc etc) and try again? 2) If so, what's in the vocab.ratios file? Could you attach the first few lines of that? 3) what's the provenance of the two corpora? are you able to attach the first few lines of each? 4) may i see the filled-out template script?

surafelml commented 6 years ago
  1. Yes, I get the same result after deleting and trying again - I have tried this several times even using different representative and unadapted datasets.

  2. vocab.ratios first few lines

    sub-orbital     202.2608208575  0.0001985999    4       0.0000009819    4
    Tourist 202.2566400521  0.0002482498    5       0.0000012274    5
    briquettes      202.2493437132  0.0004468497    9       0.0000022094    9
    year-and-a-half 202.2403258656  0.0000496500    1       0.0000002455    1
    Woz     202.2403258656  0.0000496500    1       0.0000002455    1
    water-treatment 202.2403258656  0.0000496500    1       0.0000002455    1
  3. Both files are speech transcriptions (e.g: TED talks)

REPRESENTATIVE

Thank you so much , Chris . And it 's truly a great honor to have the opportunity to come to this stage twice ; I 'm extremely grateful .
I have been blown away by this conference , and I want to thank all of you for the many nice comments
about what I had to say the other night . And I say that sincerely ,
partly because I need that .
I flew on Air Force Two for eight years .
Now I have to take off my shoes or boots to get on an airplane !

UNADAPTED

Kosovo's privatisation process is under scrutiny
Kosovo is taking a hard look at its privatisation process in light of recurring complaints.
By Muhamet Brajshori for Southeast European Times in Pristina -- 21/03/12
Feronikel was privatised five years ago, and is still in business, but operates amid concerns for workers’ safety. [Reuters]
On paper at least, it looks like a great idea.
The government sells a business, gets out from under the yoke of management, and money from the sale helps to fund the state budget.
But in Kosovo, critics say the legal process involved with privatisation is both complex and politically charged, which will have a long-term impact on the economy.
They say some owners or employees can take advantage of specific loopholes while others get almost nothing.
There is also an ethnic component, since individuals from various communities can say that discrimination -- either ongoing or previous -- affected their ability to benefit from privatisation.
Esat Berisha is one such example.
  1. template script (I have both the REPR. and UNADAPTED data in cynical_data-selection path)
    
    data_dir=/path/to/cynical_data-selection
    code_path=/path/to/cynical_data-selection

these two are just used to compute the vocab stats that define the

probability distributions for the language used in each corpus.

task_distribution_file="representative_sentences.en" ## data used to define the task unadapted_distribution_file="all_other_data.en" ## what the rest of the data looks like

these two are corpora.

seed_corpus_file=$task_distribution_file #$"" ## anything already translated from available_corpus

available_corpus_file=$unadapted_distribution_file ## candidate sentence pool

available_corpus_file=$task_distribution_file ## candidate sentence pool

this is the output file

jaded_file=jaded.$available_corpus_file

batchmode selects log(k) sentences per iteration. much faster, much

more approximate. disabled by default, but essential for huge

corpora. set to 1 to enable.

batchmode=1

ignore words that appear fewer than $mincount times. default 3.

mincount=3

set keep_boring to 1 if you'd like to NOT squish words with a vocab

ratio close to 1. this is not common.

keep_boring=0

set to 1 to lowercase the data. helps reduce lexicon further.

needs_lowercasing=0

TO-DO: add a $verbose var to turn off all the logging.

verbose=0

working_dir=$data_dir mkdir -p $working_dir

mark all tokens that start with double underscores, because we use

__ to indicate special information later.

for file in $task_distribution_file $unadapted_distribution_file $seed_corpus_file $available_corpus_file; do if [ ! -f $file.fix ]; then

"match at least two underscores preceded by space or

    ## start-of-line, and mark them"
    ## also need to nuke non-breaking whitespace:
    ## s/\s+/ /g;  ## in utf8 strings, \s matches non-breaking space
    ## perl -CS : declare STD{IN,OUT,ERR} to be UTF-8
    cat $file \
        | perl -pe 's/(\s|^)(__+)/$1\@$2\@/g;' \
        | perl -CS -pe 's/\s+/ /g; s/ $//; $_.="\n"' \
        > $file.fix
fi;

done;

for file in $task_distribution_file $unadapted_distribution_file $seed_corpus_file $available_corpus_file; do echo -n " * compute vocab stats for $file ... " input=$data_dir/$file.fix output=$working_dir/vocab.$file.fix input_tmp=$data_dir/$file.tmp if [ ! -f $output ]; then if [ "$needs_lowercasing" -eq "1" ]; then

perl -CS : declare STD{IN,OUT,ERR} to be UTF-8

        ## see http://perldoc.perl.org/perlrun.html#*-C-[_number/list_]*
        cat $input | perl -CS -pe '$_=lc($_);' > $input_tmp
    else
        ln -s $input $input_tmp
    fi

    $code_path/amittai-vocab-compute-counts.pl \
        --corpus=$input_tmp                     \
        --vcb_file=$output
    rm $input_tmp
fi
echo "...done"

done;

compute relative vocab stats for the corpora

echo -n " * compute relative statistics between corpora ... " output=$working_dir/vocab.ratios.task-unadapted if [ ! -f $output ]; then $code_path/amittai-vocab-ratios.pl \ --model1=$working_dir/vocab.$task_distribution_file.fix \ --model2=$working_dir/vocab.$unadapted_distribution_file.fix \ | sort --general-numeric-sort --reverse --key=2,2 \

$output fi echo "...done"

read in relative vocab statistics

input=$output echo " * tmp_message: calling perl script ... "

stdout/stderr contain useful info for debugging. the actual

selected data appears in the $jaded file.

stdoutput=$working_dir/$available_corpus_file.cynical output=$working_dir/$jaded_file flags=" --mincount=$mincount "; if [ "$batchmode" -gt "0" ]; then

set batchmode flag

flags="${flags} --batchmode "

fi if [ "$keep_boring" -gt "0" ]; then

set batchmode flag

flags="${flags} --keep_boring "

fi

if [ ! -f $working_dir/$jaded_file ]; then ##CALL PERL SCRIPTS FOR SELECTION $code_path/amittai-cynical-selection.pl \ --task=$data_dir/$task_distribution_file.fix \ --unadapted=$data_dir/$unadapted_distribution_file.fix \ --available=$data_dir/$available_corpus_file.fix \ --seed_vocab=$working_dir/vocab.$seed_corpus_file.fix \ --working_dir=$working_dir \ --stats=$input \ --jaded=$output $flags \

$stdoutput.stdout 2> $stdoutput.stderr fi echo "...done"

Note that the jaded.*.txt file will contain double_underscores marked as "@__@".

exit;



Thanks!
amittai commented 6 years ago

Thanks for the information! I think this is the key:

task_distribution_file="representative_sentences.en" ## data used to define the task unadapted_distribution_file="all_other_data.en" ## what the rest of the data looks like seed_corpus_file=$task_distribution_file ## anything already translated from available_corpus available_corpus_file=$task_distribution_file ## candidate sentence pool

This setup says: "Pick sentences from TASK, and add them to TASK, in order to build a better system for TASK". This is probably not what you're trying to do! For one, that would give you a total training corpus that is just two copies of TASK concatenated together. For two, it means that UNADAPT isn't being used for anything!

TASK is the kind of data you know you want to build a system for. this is data that you do believe is good. UNADAPT is the data that you currently have and **do not know* whether it is good. SEED is the data you want to add to. if it's empty then it means you want to build a system starting from scratch (because you have some reason for not concatenating the selected data with the TASK data, such as you want to build a 2-model system). it's ok to have SEED=TASK; that means that you intend to concatenate the selected data to the TASK data and train one system that is bigger than if you only had TASK data, but smaller than if you used all of AVAIL. AVAILABLE is the data you are going to pick from**. nearly always: AVAIL=UNADAPT.

If i wanted to: "pick sentences from UNADAPT to build a system tailored to TASK, but don't include the sentences from TASK in the new system" then i'd use:

task_distribution_file="representative_sentences.en" ## data used to define the task unadapted_distribution_file="all_other_data.en" ## what the rest of the data looks like seed_corpus_file="" ## anything already translated from available_corpus available_corpus_file=$unadapted_distribution_file ## candidate sentence pool

If i wanted to: "pick some sentences from UNADAPT to add to TASK so that i can build a bigger and better system than if i only used the TASK data by itself" then i'd use:

task_distribution_file="representative_sentences.en" ## data used to define the task unadapted_distribution_file="all_other_data.en" ## what the rest of the data looks like seed_corpus_file=$task_distribution_file ## anything already translated from available_corpus available_corpus_file=$unadapted_distribution_file ## candidate sentence pool

Please let me know if this works (or not)! Also, please pull the latest commit from last week. It's just bug fixes, but some of them include clearer error messages and sanity checking. My apologies for not having written a clearer explanation from the beginning. Releasing more documentation is on my to-do list, and I'll bear this in mind. Thanks for your help!

surafelml commented 6 years ago

Thank You! for the detailed walk-through. I have tried both ways of the configuration to add/not to add the TASK data while picking the new sentences from UNADAPTED. However, I am again ending up with an empty "jaded.all_other_data.en". I'm sure that everything is working perfectly but something is not yet clear for me.

Am also going again through your papers to figure out how the *.perl script works in the selection process. https://arxiv.org/pdf/1709.02279.pdf

I will do further experiments with the new pull and and will add more on the outcomes.

Thanks!

amittai commented 6 years ago

Please keep me posted!