Closed surafelml closed 6 years ago
Hi -- That shouldn't be the case, so let's try to figure out why. Could you tell me what's in the .stderr file? Is .stdout empty? Cheers, ~amittai
For instance I tried to run the algorithm using "representative_sentences.en --> English corpora with 200K sentences and "representative_sentences.en --> a separate English corpora with 400K sentences", here are the outputs:
*.stdout
*.stderr:
I have tried to see what's going on at line 273 of amittai-cynical-selection.pl and it's doesn't go in the loop, which I believe is responsible for populating the jaded file with sorted sentences
` print STDERR "running max $debuggingcounter iterations!"; while ($debuggingcounter > 0){
last unless (@word_gain_estimates);
`
Thanks!!
It looks like you have the REPRESENTATIVE corpus to be the same as the UNADAPTED corpus. Is that the intention? That forces all of the probability ratios to be 1, which means that the method thinks there is no discriminative information in the vocabulary.
In the domain adaptation scenario (most common), the REPRESENTATIVE corpus is the data you wish you were good at (often fairly small -- sounds like your 200k set), and the UNADAPTED corpus is the data you actually have available for training (often much larger, sounds like your 400k set). Normally also UNADAPT = AVAILABLE.
Apologies, for the typo in my previous comment. I specified the 400k as the "representative_sentences.en", which was actually the "all_other_data.en". As you have also mentioned out I have doubled checked this, however, to end up with the same kind issue (an empty "jaded.representative_sentences.en" file).
Hullo -- It's actually the info in *.stdout that's making me suspicious, not the typo! STDOUT is the entire vocabulary that the selection method is using. According to that file, absolutely every element of the lexicon appears with the same probability in your REPRESENTATIVE and UNADAPTED corpora (all words have been replaced with special tokens that mean "the ratio is the same"). That's really really unusual, and i'm used to seeing it when i accidentally use the same file twice :)
1) do you get the same result if you delete absolutely everything (vocab files, counts, fix, squish, jaded* etc etc) and try again? 2) If so, what's in the vocab.ratios file? Could you attach the first few lines of that? 3) what's the provenance of the two corpora? are you able to attach the first few lines of each? 4) may i see the filled-out template script?
Yes, I get the same result after deleting and trying again - I have tried this several times even using different representative and unadapted datasets.
vocab.ratios first few lines
sub-orbital 202.2608208575 0.0001985999 4 0.0000009819 4
Tourist 202.2566400521 0.0002482498 5 0.0000012274 5
briquettes 202.2493437132 0.0004468497 9 0.0000022094 9
year-and-a-half 202.2403258656 0.0000496500 1 0.0000002455 1
Woz 202.2403258656 0.0000496500 1 0.0000002455 1
water-treatment 202.2403258656 0.0000496500 1 0.0000002455 1
Both files are speech transcriptions (e.g: TED talks)
REPRESENTATIVE
Thank you so much , Chris . And it 's truly a great honor to have the opportunity to come to this stage twice ; I 'm extremely grateful .
I have been blown away by this conference , and I want to thank all of you for the many nice comments
about what I had to say the other night . And I say that sincerely ,
partly because I need that .
I flew on Air Force Two for eight years .
Now I have to take off my shoes or boots to get on an airplane !
UNADAPTED
Kosovo's privatisation process is under scrutiny
Kosovo is taking a hard look at its privatisation process in light of recurring complaints.
By Muhamet Brajshori for Southeast European Times in Pristina -- 21/03/12
Feronikel was privatised five years ago, and is still in business, but operates amid concerns for workers’ safety. [Reuters]
On paper at least, it looks like a great idea.
The government sells a business, gets out from under the yoke of management, and money from the sale helps to fund the state budget.
But in Kosovo, critics say the legal process involved with privatisation is both complex and politically charged, which will have a long-term impact on the economy.
They say some owners or employees can take advantage of specific loopholes while others get almost nothing.
There is also an ethnic component, since individuals from various communities can say that discrimination -- either ongoing or previous -- affected their ability to benefit from privatisation.
Esat Berisha is one such example.
data_dir=/path/to/cynical_data-selection
code_path=/path/to/cynical_data-selection
task_distribution_file="representative_sentences.en" ## data used to define the task unadapted_distribution_file="all_other_data.en" ## what the rest of the data looks like
seed_corpus_file=$task_distribution_file #$"" ## anything already translated from available_corpus
available_corpus_file=$task_distribution_file ## candidate sentence pool
jaded_file=jaded.$available_corpus_file
batchmode=1
mincount=3
keep_boring=0
needs_lowercasing=0
verbose=0
working_dir=$data_dir mkdir -p $working_dir
for file in $task_distribution_file $unadapted_distribution_file $seed_corpus_file $available_corpus_file; do if [ ! -f $file.fix ]; then
## start-of-line, and mark them"
## also need to nuke non-breaking whitespace:
## s/\s+/ /g; ## in utf8 strings, \s matches non-breaking space
## perl -CS : declare STD{IN,OUT,ERR} to be UTF-8
cat $file \
| perl -pe 's/(\s|^)(__+)/$1\@$2\@/g;' \
| perl -CS -pe 's/\s+/ /g; s/ $//; $_.="\n"' \
> $file.fix
fi;
done;
for file in $task_distribution_file $unadapted_distribution_file $seed_corpus_file $available_corpus_file; do echo -n " * compute vocab stats for $file ... " input=$data_dir/$file.fix output=$working_dir/vocab.$file.fix input_tmp=$data_dir/$file.tmp if [ ! -f $output ]; then if [ "$needs_lowercasing" -eq "1" ]; then
## see http://perldoc.perl.org/perlrun.html#*-C-[_number/list_]*
cat $input | perl -CS -pe '$_=lc($_);' > $input_tmp
else
ln -s $input $input_tmp
fi
$code_path/amittai-vocab-compute-counts.pl \
--corpus=$input_tmp \
--vcb_file=$output
rm $input_tmp
fi
echo "...done"
done;
echo -n " * compute relative statistics between corpora ... " output=$working_dir/vocab.ratios.task-unadapted if [ ! -f $output ]; then $code_path/amittai-vocab-ratios.pl \ --model1=$working_dir/vocab.$task_distribution_file.fix \ --model2=$working_dir/vocab.$unadapted_distribution_file.fix \ | sort --general-numeric-sort --reverse --key=2,2 \
$output fi echo "...done"
input=$output echo " * tmp_message: calling perl script ... "
stdoutput=$working_dir/$available_corpus_file.cynical output=$working_dir/$jaded_file flags=" --mincount=$mincount "; if [ "$batchmode" -gt "0" ]; then
flags="${flags} --batchmode "
fi if [ "$keep_boring" -gt "0" ]; then
flags="${flags} --keep_boring "
fi
if [ ! -f $working_dir/$jaded_file ]; then ##CALL PERL SCRIPTS FOR SELECTION $code_path/amittai-cynical-selection.pl \ --task=$data_dir/$task_distribution_file.fix \ --unadapted=$data_dir/$unadapted_distribution_file.fix \ --available=$data_dir/$available_corpus_file.fix \ --seed_vocab=$working_dir/vocab.$seed_corpus_file.fix \ --working_dir=$working_dir \ --stats=$input \ --jaded=$output $flags \
$stdoutput.stdout 2> $stdoutput.stderr fi echo "...done"
Note that the jaded.*.txt file will contain double_underscores marked as "@__@".
exit;
Thanks!
Thanks for the information! I think this is the key:
task_distribution_file="representative_sentences.en" ## data used to define the task unadapted_distribution_file="all_other_data.en" ## what the rest of the data looks like seed_corpus_file=$task_distribution_file ## anything already translated from available_corpus available_corpus_file=$task_distribution_file ## candidate sentence pool
This setup says: "Pick sentences from TASK, and add them to TASK, in order to build a better system for TASK". This is probably not what you're trying to do! For one, that would give you a total training corpus that is just two copies of TASK concatenated together. For two, it means that UNADAPT isn't being used for anything!
TASK is the kind of data you know you want to build a system for. this is data that you do believe is good. UNADAPT is the data that you currently have and **do not know* whether it is good. SEED is the data you want to add to. if it's empty then it means you want to build a system starting from scratch (because you have some reason for not concatenating the selected data with the TASK data, such as you want to build a 2-model system). it's ok to have SEED=TASK; that means that you intend to concatenate the selected data to the TASK data and train one system that is bigger than if you only had TASK data, but smaller than if you used all of AVAIL. AVAILABLE is the data you are going to pick from**. nearly always: AVAIL=UNADAPT.
If i wanted to: "pick sentences from UNADAPT to build a system tailored to TASK, but don't include the sentences from TASK in the new system" then i'd use:
task_distribution_file="representative_sentences.en" ## data used to define the task unadapted_distribution_file="all_other_data.en" ## what the rest of the data looks like seed_corpus_file="" ## anything already translated from available_corpus available_corpus_file=$unadapted_distribution_file ## candidate sentence pool
If i wanted to: "pick some sentences from UNADAPT to add to TASK so that i can build a bigger and better system than if i only used the TASK data by itself" then i'd use:
task_distribution_file="representative_sentences.en" ## data used to define the task unadapted_distribution_file="all_other_data.en" ## what the rest of the data looks like seed_corpus_file=$task_distribution_file ## anything already translated from available_corpus available_corpus_file=$unadapted_distribution_file ## candidate sentence pool
Please let me know if this works (or not)! Also, please pull the latest commit from last week. It's just bug fixes, but some of them include clearer error messages and sanity checking. My apologies for not having written a clearer explanation from the beginning. Releasing more documentation is on my to-do list, and I'll bear this in mind. Thanks for your help!
Thank You! for the detailed walk-through. I have tried both ways of the configuration to add/not to add the TASK data while picking the new sentences from UNADAPTED. However, I am again ending up with an empty "jaded.all_other_data.en". I'm sure that everything is working perfectly but something is not yet clear for me.
Am also going again through your papers to figure out how the *.perl script works in the selection process. https://arxiv.org/pdf/1709.02279.pdf
I will do further experiments with the new pull and and will add more on the outcomes.
Thanks!
Please keep me posted!
Hi @amittai After running the algorithm using toy files (representative_sentences.en, .all_other_data.en), the execution finishes with an empty Jaded text file. Any hint on what's going on?
Many Thanks!