Failed to interpret mfcc.npz file as a pickle

EomSooHwan commented 2 years ago

During creating dataset for alffa database in hshmm recipe, I encountered an error message

File "/mnt/hdd/user/anaconda3/envs/beer/lib/python3.7/site-package/numpy/lib/npyio.py", line 440, in load
     return pickle.load(fid, **pickle_kwargs)
_pickle.UnpicklingError: A load persistent id instruction was encountered

with another exception error message

OSError: Failed to interpret file '/mnt/hdd/user/workspace/HMM/features/alffa/sw/train/mfcc.npz' as a pickle

One possible reason I could think of is that during feature extraction the message said

utils/parallel/sge/parallel.sh: line 21: qsub: command not found
INFO: created archive from 0 feature files

but I am not sure if this is the case since the code kept running after that message.

Excuse me if this is a minor issue, I am new to the community.

lucasondel commented 2 years ago

Hey,

The unpickling error comes from the fact that the features extraction failed and therefore you have some empty features archive and nothing to load.

The recipe assumes a SGE-like cluster (i.e. the qsub command) to parallelize the features extraction but apparently your environment doesn't have it. In order to be able to use beer you will definitely need a cluster (we use it for the feature extraction and the training) and also a GPU (for later stages).

Could you tell us more about your computing environment ? Do you have access to a distributed cluster ? If yes is it using SGE or something else like SLURM ?

EomSooHwan commented 2 years ago

I am sorry I am not familiar with distributed cluster. Can you help me to find whether my server have a distributed cluster? I am working on remote server shared with my lab members. Also, our server have 8 GPUs with each around 12GB.

lucasondel commented 2 years ago

The best is probably to ask your admin system and/or your colleagues about the computing facilities of your lab.

It seems that you don't have a SGE-like environment (no qsub command). If you have a slurm environment then you would have the sbatch command. If you just have access to a single machine (like a server) then you can either use the Kaldi script https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/utils/parallel/run.pl or alternately the GNU parallel software (the run.pl is probably easier).

In all cases, since you don't have qsub, you will need to create a new directory, say my_parallel_env, in [https://github.com/beer-asr/beer/tree/master/recipes/aud/utils/parallel/]() that contains 2 scripts:

parallel.sh that launches n parallel jobs (this is for extracting the features, training)
single.sh that launches just one task.

Have a look at the example here: [https://github.com/beer-asr/beer/tree/master/recipes/aud/utils/parallel/sge](). Once this is done, you can specify my_parallel_env as the parallel environment when calling submit_parallel.sh.

EomSooHwan commented 2 years ago

Thank you for your detailed advices! It seems my server does not have slurm-client installed, so I think I should check Kaldi script. I will try the methods you have suggested and let you know if any problem happens.

EomSooHwan commented 2 years ago

Sorry, I am currently having problems on how to use run.pl inside the code.

I have added my_parallel_env in recipes/hshmm/utils/parallel with run.pl, but I am confused on how to write parallel.sh and single.sh for run.pl. The code currently cannot interpret run.pl as a command. Also, should the command run.pl JOB=1:$njobs $log_dir/${name}.JOB.log "$cmd" $split_dir || exit 1 be okay?

Sorry for the inconvinence.

lucasondel commented 2 years ago

Not sure I understand your issue here. You need to download the run.pl and write a new parallel.sh that will call this file directly, something like:

/my/path/to/run.pl JOB=1:$nbjobs ...

As how to write parallel.sh: the purpose of this script is to execute a task, say to extract features, on several input files in parallel. The input files are divided with the split command so they all have the form x[0-9]*. For instance this is how I treat them with the sge qsub command ` file: https://github.com/beer-asr/beer/blob/d53d2a108761ac2fa3fbe5bfaeaa529bac94d350/recipes/aud/utils/parallel/sge/jobarray.qsub#L26

So I think your parallel.sh should look like this (not tested):

# Given as argument to the script.
splitdir=...

# This option is necessary to use the pattern `x*(0)JOB`.
shopt -s extglob

# I use JOBID as job identifier but run.pl uses JOB. 
cmd=$(echo $cmd | sed s/JOBID/JOB/g)
cmd="cat $splitdir/x*(0)JOB \| eval $cmd"

run.pl JOB=1:$njobs $log_dir/${name}.JOB.log "$cmd"

EomSooHwan commented 2 years ago

I think cmd="cat $splitdir/x*(0)JOB \| eval $cmd" is not working because run.pl reads x*(0)JOB as x*(0)1 x*(0)2 and so on. Is there any way I could indiciate x0001 x002... using JOB?

I have tried various methods including perl run.pl JOB=1:$njobs $log_dir/${name}.JOB.log "cat $split_dir/x$(printf "%04d" $JOB) | eval $cmd" however it failed to access to each x0001 x002... and so on.

Also, even if it can access to the split directory, it seems to make an error message that

cat /mnt/hdd/workspace/features/alffa/sw/train/split/x0001 | eval beer features extract conf/mfcc.yml - /mnt/hdd/workspace/features/alffa/sw/train/mfcc_tmp: No such file or directory

However, I checked mfcc_tmp is made every time, and I also checked conf/mfcc.yml so I am not sure where does this error come from.

lucasondel commented 2 years ago

I think cmd="cat $splitdir/x(0)JOB | eval $cmd" is not working because run.pl reads x(0)JOB as x(0)1 x(0)2 and so on. Is there any way I could indiciate x0001 x002... using JOB?

Yes, I knew this won't work so easily. The split command is very annoying with the output file names. I think the easiest solution is that your script rename all the files:

x00...1 -> x1
x00...2 -> x2
...

Also, even if it can access to the split directory, it seems to make an error message that cat /mnt/hdd/workspace/features/alffa/sw/train/split/x0001 | eval beer features extract conf/mfcc.yml - /mnt/hdd/workspace/features/alffa/sw/train/mfcc_tmp: No such file or directory However, I checked mfcc_tmp is made every time, and I also checked conf/mfcc.yml so I am not sure where does this error come from.

This one I'm not so sure. Perhaps try to provide absolute paths, it is possible that run.pl execute the process from a different working directory. Otherwise, you can check the features extraction script: https://github.com/beer-asr/beer/blob/master/beer/cli/subcommands/features/extract.py to try to see which file cannot be found.

EomSooHwan commented 2 years ago

Thank you for your advice. I have changed parallel.sh to execute run.pl for each number of digits (1 to 9, 10 to 99...) so I think this problem is partially fixed.

However, the issue is that cat /mnt/hdd/workspace/features/alffa/sw/train/split/x01 | eval beer features extract /mnt/hdd/workspace/beer/recipes/hshmm/conf/mfcc.yml - /mnt/hdd/workspace/features/alffa/sw/train/mfcc_tmp still gives me No such file or directory error message. Moreover, when I run this command on terminal, beer features extract works totally fine (I did changed extract.py a bit so that it makes the directory for the save path). I am not sure which part is causing this error.

lucasondel commented 2 years ago

Could you show me the result of:

head /mnt/hdd/workspace/features/alffa/sw/train/split/x01

And also the content of parallel.sh (and related scripts). My guess is that the paths in /mnt/hdd/workspace/features/alffa/sw/train/split/x01 is relative to your working directory.

EomSooHwan commented 2 years ago

The paths in /mnt/hdd/workspace/features/alffa/sw/train/split/x01 are absolute paths. Also, my parallel.sh is

#!/bin/bash

if [ $# -ne 6 ]; then
    echo "$0 <name> <opts> <njobs> <split-dir> <cmd> <log-dir>"
    exit 1
fi

name=$1
opts=$2
njobs=$3
split_dir=$4
cmd=$5
log_dir=$6

shopt -s extglob
cmd=$(echo $cmd | sed s/JOBID/JOB/g)

perl run.pl JOB=1:9 $log_dir/${name}.JOB.log "cat $split_dir/x0JOB | eval $cmd"
perl run.pl JOB=10:32 $log_dir/${name}.JOB.log "cat $split_dir/xJOB | eval $cmd"

lucasondel commented 2 years ago

This look ok to me... Just for debugging purposes, could you please add

echo $cmd

just before the perl ... statements ? and let me know the output

EomSooHwan commented 2 years ago

Actually, I think I figured out what the problem was. I had to do perl run.pl JOB=1:9 $log_dir/${name}.JOB.log cat "$split_dir/x0JOB" \| eval $cmd because the code was reading the whole command as some kind of directory.

Thank you for helping me a lot for this problem!

beer-asr / beer

Failed to interpret mfcc.npz file as a pickle #102