Closed EomSooHwan closed 2 years ago
Hey,
The unpickling error comes from the fact that the features extraction failed and therefore you have some empty features archive and nothing to load.
The recipe assumes a SGE-like cluster (i.e. the qsub
command) to parallelize the features extraction but apparently your environment doesn't have it. In order to be able to use beer you will definitely need a cluster (we use it for the feature extraction and the training) and also a GPU (for later stages).
Could you tell us more about your computing environment ? Do you have access to a distributed cluster ? If yes is it using SGE or something else like SLURM ?
I am sorry I am not familiar with distributed cluster. Can you help me to find whether my server have a distributed cluster? I am working on remote server shared with my lab members. Also, our server have 8 GPUs with each around 12GB.
The best is probably to ask your admin system and/or your colleagues about the computing facilities of your lab.
It seems that you don't have a SGE-like environment (no qsub
command). If you have a slurm environment then you would have the sbatch
command. If you just have access to a single machine (like a server) then you can either use the Kaldi script https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/utils/parallel/run.pl or alternately the GNU parallel software (the run.pl
is probably easier).
In all cases, since you don't have qsub
, you will need to create a new directory, say my_parallel_env
, in [https://github.com/beer-asr/beer/tree/master/recipes/aud/utils/parallel/]() that contains 2 scripts:
parallel.sh
that launches n parallel jobs (this is for extracting the features, training)single.sh
that launches just one task. Have a look at the example here: [https://github.com/beer-asr/beer/tree/master/recipes/aud/utils/parallel/sge]().
Once this is done, you can specify my_parallel_env
as the parallel environment when calling submit_parallel.sh
.
Thank you for your detailed advices!
It seems my server does not have slurm-client
installed, so I think I should check Kaldi script. I will try the methods you have suggested and let you know if any problem happens.
Sorry, I am currently having problems on how to use run.pl
inside the code.
I have added my_parallel_env
in recipes/hshmm/utils/parallel
with run.pl
, but I am confused on how to write parallel.sh
and single.sh
for run.pl
. The code currently cannot interpret run.pl
as a command. Also, should the command run.pl JOB=1:$njobs $log_dir/${name}.JOB.log "$cmd" $split_dir || exit 1
be okay?
Sorry for the inconvinence.
Not sure I understand your issue here. You need to download the run.pl
and write a new parallel.sh
that will call this file directly, something like:
/my/path/to/run.pl JOB=1:$nbjobs ...
As how to write parallel.sh
: the purpose of this script is to execute a task, say to extract features, on several input files in parallel. The input files are divided with the split
command so they all have the form x[0-9]*
. For instance this is how I treat them with the sge qsub
command ` file: https://github.com/beer-asr/beer/blob/d53d2a108761ac2fa3fbe5bfaeaa529bac94d350/recipes/aud/utils/parallel/sge/jobarray.qsub#L26
So I think your parallel.sh
should look like this (not tested):
# Given as argument to the script.
splitdir=...
# This option is necessary to use the pattern `x*(0)JOB`.
shopt -s extglob
# I use JOBID as job identifier but run.pl uses JOB.
cmd=$(echo $cmd | sed s/JOBID/JOB/g)
cmd="cat $splitdir/x*(0)JOB \| eval $cmd"
run.pl JOB=1:$njobs $log_dir/${name}.JOB.log "$cmd"
I think cmd="cat $splitdir/x*(0)JOB \| eval $cmd"
is not working because run.pl reads x*(0)JOB
as x*(0)1
x*(0)2
and so on. Is there any way I could indiciate x0001
x002
... using JOB?
I have tried various methods including perl run.pl JOB=1:$njobs $log_dir/${name}.JOB.log "cat $split_dir/x$(printf "%04d" $JOB) | eval $cmd"
however it failed to access to each x0001
x002
... and so on.
Also, even if it can access to the split directory, it seems to make an error message that
cat /mnt/hdd/workspace/features/alffa/sw/train/split/x0001 | eval beer features extract conf/mfcc.yml - /mnt/hdd/workspace/features/alffa/sw/train/mfcc_tmp: No such file or directory
However, I checked mfcc_tmp is made every time, and I also checked conf/mfcc.yml so I am not sure where does this error come from.
I think cmd="cat $splitdir/x(0)JOB | eval $cmd" is not working because run.pl reads x(0)JOB as x(0)1 x(0)2 and so on. Is there any way I could indiciate x0001 x002... using JOB?
Yes, I knew this won't work so easily. The split
command is very annoying with the output file names. I think the easiest solution is that your script rename all the files:
Also, even if it can access to the split directory, it seems to make an error message that
cat /mnt/hdd/workspace/features/alffa/sw/train/split/x0001 | eval beer features extract conf/mfcc.yml - /mnt/hdd/workspace/features/alffa/sw/train/mfcc_tmp: No such file or directory
However, I checked mfcc_tmp is made every time, and I also checked conf/mfcc.yml so I am not sure where does this error come from.
This one I'm not so sure. Perhaps try to provide absolute paths, it is possible that run.pl execute the process from a different working directory. Otherwise, you can check the features extraction script: https://github.com/beer-asr/beer/blob/master/beer/cli/subcommands/features/extract.py to try to see which file cannot be found.
Thank you for your advice. I have changed parallel.sh to execute run.pl for each number of digits (1 to 9, 10 to 99...) so I think this problem is partially fixed.
However, the issue is that cat /mnt/hdd/workspace/features/alffa/sw/train/split/x01 | eval beer features extract /mnt/hdd/workspace/beer/recipes/hshmm/conf/mfcc.yml - /mnt/hdd/workspace/features/alffa/sw/train/mfcc_tmp
still gives me No such file or directory
error message. Moreover, when I run this command on terminal, beer features extract
works totally fine (I did changed extract.py
a bit so that it makes the directory for the save path). I am not sure which part is causing this error.
Could you show me the result of:
head /mnt/hdd/workspace/features/alffa/sw/train/split/x01
And also the content of parallel.sh
(and related scripts). My guess is that the paths in /mnt/hdd/workspace/features/alffa/sw/train/split/x01
is relative to your working directory.
The paths in /mnt/hdd/workspace/features/alffa/sw/train/split/x01
are absolute paths. Also, my parallel.sh
is
#!/bin/bash
if [ $# -ne 6 ]; then
echo "$0 <name> <opts> <njobs> <split-dir> <cmd> <log-dir>"
exit 1
fi
name=$1
opts=$2
njobs=$3
split_dir=$4
cmd=$5
log_dir=$6
shopt -s extglob
cmd=$(echo $cmd | sed s/JOBID/JOB/g)
perl run.pl JOB=1:9 $log_dir/${name}.JOB.log "cat $split_dir/x0JOB | eval $cmd"
perl run.pl JOB=10:32 $log_dir/${name}.JOB.log "cat $split_dir/xJOB | eval $cmd"
This look ok to me... Just for debugging purposes, could you please add
echo $cmd
just before the perl ...
statements ? and let me know the output
Actually, I think I figured out what the problem was. I had to do perl run.pl JOB=1:9 $log_dir/${name}.JOB.log cat "$split_dir/x0JOB" \| eval $cmd
because the code was reading the whole command as some kind of directory.
Thank you for helping me a lot for this problem!
During creating dataset for alffa database in hshmm recipe, I encountered an error message
with another exception error message
One possible reason I could think of is that during feature extraction the message said
but I am not sure if this is the case since the code kept running after that message.
Excuse me if this is a minor issue, I am new to the community.