Kaldi tests not passing on a system with many cores

Thomas-Schatz commented 7 years ago

On Oberon (CentOS linux cluster) with an up-to-date install of abkhazia, the command pytest ./test --basetemp=/home/thomas/tmpdir -vv -x fails with:

===================================================================== test session starts =====================================================================
platform linux2 -- Python 2.7.13, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /home/thomas/.conda/envs/abkhazia2017/bin/python
cachedir: .cache
rootdir: /fhgfs/bootphon/scratch/thomas/abkhazia2017/abkhazia, inifile:
collected 38 items

test/test_acoustic.py::test_acoustic_njobs[4] PASSED
test/test_acoustic.py::test_acoustic_njobs[11] PASSED
test/test_acoustic.py::test_monophone_cmvn_good PASSED
test/test_acoustic.py::test_monophone_cmvn_bad PASSED
test/test_align.py::test_align[both-False] FAILED

========================================================================== FAILURES ===========================================================================
___________________________________________________________________ test_align[both-False] ____________________________________________________________________

corpus = <abkhazia.corpus.corpus.Corpus object at 0x2aab145427d0>, features = '/home/thomas/tmpdir/features0', lm_word = '/home/thomas/tmpdir/lm_word0'
am_mono = '/home/thomas/tmpdir/am_mono0', tmpdir = local('/home/thomas/tmpdir/test_align_both_False_0'), level = 'both', post = False

    @pytest.mark.parametrize('level, post', params)
    def test_align(
            corpus, features, lm_word, am_mono, tmpdir, level, post):
        output_dir = str(tmpdir.mkdir('align-phones'))
        flog = os.path.join(output_dir, 'align-phones.log')
        log = utils.logger.get_log(flog)

        aligner = align.Align(corpus, output_dir=output_dir, log=log)
        aligner.feat_dir = features
        aligner.lm_dir = lm_word
        aligner.am_dir = am_mono
        aligner.level = level
        aligner.with_posteriors = post
>       aligner.compute()

test/test_align.py:41:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
abkhazia/abstract_recipe.py:185: in compute
    self.run()
abkhazia/align/align.py:97: in run
    self._align_fmllr()
abkhazia/align/align.py:167: in _align_fmllr
    self._target_dir()))
abkhazia/abstract_recipe.py:102: in _run_command
                # Parent
                if gc_was_enabled:
                    gc.enable()
            finally:
                # be sure the FD is closed no matter what
                os.close(errpipe_write)

            # Wait for exec to fail or succeed; possibly raising exception
            data = _eintr_retry_call(os.read, errpipe_read, 1048576)
            pickle_bits = []
            while data:
                pickle_bits.append(data)
                data = _eintr_retry_call(os.read, errpipe_read, 1048576)
            data = "".join(pickle_bits)
        finally:
            if p2cread is not None and p2cwrite is not None:
                _close_in_parent(p2cread)
            if c2pwrite is not None and c2pread is not None:
                _close_in_parent(c2pwrite)
            if errwrite is not None and errread is not None:
                _close_in_parent(errwrite)

            # be sure the FD is closed no matter what
            os.close(errpipe_read)

        if data != "":
            try:
                _eintr_retry_call(os.waitpid, self.pid, 0)
            except OSError as e:
                if e.errno != errno.ECHILD:
                    raise
            child_exception = pickle.loads(data)
>           raise child_exception
E           OSError: [Errno 2] No such file or directory

/home/thomas/.conda/envs/abkhazia2017/lib/python2.7/subprocess.py:1024: OSError
-------------------------------------------------------------------- Captured stdout setup --------------------------------------------------------------------
training monophone model
-------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------
computing fMLLR alignment lattice
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
============================================================ 1 failed, 4 passed in 574.34 seconds =============================================================

abkhazia conf file:

# This is the abkhazia configuration file. This file is automatically
# generated during installation. Change the values in here to overload
# the default configuration.

[abkhazia]
# The absolute path to the output data directory of abkhazia.
data-directory:

# The directory where abkhazia write temporary data (usually /tmp or
# /dev/shm).
tmp-directory: /tmp

[kaldi]
# The absolute path to the kaldi distribution directory
kaldi-directory: /cm/shared/apps/kaldi

# "queue.pl" uses qsub. The options to it are options to qsub.  If you
# have GridEngine installed, change this to a queue you have access
# to. Otherwise, use "run.pl", which will run jobs locally

# On Oberon use:
train-cmd: queue.pl -q all.q@puck*.cm.cluster
decode-cmd: queue.pl -q all.q@puck*.cm.cluster
highmem-cmd: queue.pl -q all.q@puck*.cm.cluster

# On Eddie use:
# train-cmd: queue.pl -P inf_hcrc_cstr_general
# decode-cmd: queue.pl -P inf_hcrc_cstr_general
# highmem-cmd: queue.pl -P inf_hcrc_cstr_general -pe memory-2G 2

# To run locally use:
#train-cmd: run.pl
#decode-cmd: run.pl
#highmem-cmd: run.pl

[corpus]
# In this section you can specify the default input directory where to
# read raw data for each supported corpus. By doing so, the
# <input-dir> argument of 'abkhazia prepare <corpus>' becomes optional
# for the corpus you have specified directories here.
aic-directory:
buckeye-directory: /fhgfs/bootphon/data/raw_data/BUCKEYE_revised_bootphon
childes-directory:
cid-directory:
csj-directory:
globalphone-directory:
librispeech-directory:
wsj-directory:
xitsonga-directory:

It looks like the command abkhazia/align/align.py:167: in _align_fmllr self._target_dir())) failed with OSError: [Errno 2] No such file or directory.

Thomas-Schatz commented 7 years ago

So the problem was that my version of kaldi was too old and the script used for aligning (steps/align_fmllr_lats.sh) did not exist. Specifiying in my abkhazia.conf file the path to the kaldi install provided by Mathieu solved the problem.

The questions remain of how to allow external users to get the right version of Kaldi.

I also get another failure later in the tests: test/test_decode.py::test_decode_mono FAILED_

E           RuntimeError: command "utils/queue.pl -q all.q@puck*.cm.cluster /home/thomas/tmpdir/test_decode_mono0/decode-mono/recipe/graph/mkgraph.log utils/mkgraph.sh --mono  --transition-scale 1.0 --self-loop-scale 0.1 /home/thomas/tmpdir/lm_word0 /home/thomas/tmpdir/am_mono0 /home/thomas/tmpdir/test_decode_mono0/decode-mono/recipe/graph" returned with 1

abkhazia/utils/jobs.py:73: RuntimeError
-------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------
computing full decoding graph
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

From the log in /home/thomas/tmpdir, I get:

# Running on puck2
# Started at Mon Jul 3 20:06:59 CEST 2017
# utils/mkgraph.sh --mono --transition-scale 1.0 --self-loop-scale 0.1 /home/thomas/tmpdir/lm_word0 /home/thomas/tmpdir/am_mono0 /home/thomas/tmpdir/test_decode_mono
0/decode-mono/recipe/graph
fstarcsort: error while loading shared libraries: libfstscript.so.1: cannot open shared object file: No such file or directory
fsttablecompose /home/thomas/tmpdir/lm_word0/L_disambig.fst /home/thomas/tmpdir/lm_word0/G.fst
fstminimizeencoded
fstpushspecial
fstdeterminizestar --use-log=true
# Accounting: time=1 threads=1
# Finished at Mon Jul 3 20:07:00 CEST 2017 with status 1

So it appears that the initial error is fstarcsort: error while loading shared libraries: libfstscript.so.1: cannot open shared object file: No such file or directory.

I'm currently re-running the tests to see if I can reproduce it. Any idea what happened?

mmmaat commented 7 years ago

Ok thank you Thomas for reporting this. The issue is that one the binary fstarcsort doesn't find the library libfstscript.so. Maybe there's a mess in my personal Kaldi installation??

I suggest you to compile your own Kaldi from scratch by following https://abkhazia.readthedocs.io/en/latest/install.html#kaldi

I made a fork of Kaldi for compatibility with abkhazia here: https://github.com/bootphon/kaldi

M

Thomas-Schatz commented 7 years ago

I was not able to reproduce the issue when running the test again... Instead it failed when testing neural network training:

(abkhazia2017)[thomas@oberon abkhazia]$ pytest ./test --basetemp=/home/thomas/tmpdir -x -v
======================================================================== test session starts ========================================================================
platform linux2 -- Python 2.7.13, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /home/thomas/.conda/envs/abkhazia2017/bin/python
cachedir: .cache
rootdir: /fhgfs/bootphon/scratch/thomas/abkhazia2017/abkhazia, inifile:
collected 38 items

test/test_acoustic.py::test_acoustic_njobs[4] PASSED
test/test_acoustic.py::test_acoustic_njobs[11] PASSED
test/test_acoustic.py::test_monophone_cmvn_good PASSED
test/test_acoustic.py::test_monophone_cmvn_bad PASSED
test/test_align.py::test_align[both-False] PASSED
test/test_ark.py::test_read_write[text] PASSED
test/test_ark.py::test_read_write[binary] PASSED
test/test_ark.py::test_h5f_name_of_utterance[a] PASSED
test/test_ark.py::test_h5f_name_of_utterance[a-b] PASSED
test/test_ark.py::test_h5f_name_of_utterance[a_b] PASSED
test/test_ark.py::test_h5f_twice PASSED
test/test_corpus.py::test_save_corpus PASSED
test/test_corpus.py::test_empty PASSED
test/test_corpus.py::test_subcorpus PASSED
test/test_corpus.py::test_split PASSED
test/test_corpus.py::test_split_tiny_train PASSED
test/test_corpus.py::test_split_by_speakers PASSED
test/test_corpus.py::test_spk2utt PASSED
test/test_corpus.py::test_phonemize_text PASSED
test/test_decode.py::test_decode_mono PASSED
test/test_decode.py::test_decode_tri PASSED
test/test_decode.py::test_decode_trisa PASSED
test/test_decode.py::test_decode_nnet ERROR

============================================================================== ERRORS ===============================================================================
________________________________________________________________ ERROR at setup of test_decode_nnet _________________________________________________________________

corpus = <abkhazia.corpus.corpus.Corpus object at 0x2aab10531390>, features = '/home/thomas/tmpdir/features0', lm_word = '/home/thomas/tmpdir/lm_word0'
am_trisa = '/home/thomas/tmpdir/am_trisa0', tmpdir_factory = <_pytest.tmpdir.TempdirFactory instance at 0x2aaaed4417a0>

    @pytest.fixture(scope='session')
    def am_nnet(corpus, features, lm_word, am_trisa, tmpdir_factory):
        output_dir = str(tmpdir_factory.mktemp('am_nnet'))
        flog = os.path.join(output_dir, 'am_nnet.log')
        log = utils.logger.get_log(flog)
        am = acoustic.NeuralNetwork(
            corpus, lm_word, features, am_trisa, output_dir, log=log)

        am.options['num-epochs'].value = 2
        am.options['num-epochs-extra'].value = 1
        am.options['num-hidden-layers'].value = 1
        am.options['num-iters-final'].value = 1
        am.options['pnorm-input-dim'].value = 100
        am.options['pnorm-output-dim'].value = 10
        am.options['num-utts-subset'].value = 20
>       am.compute()

test/conftest.py:168:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
abkhazia/abstract_recipe.py:185: in compute
    self.run()
abkhazia/acoustic/neural_network.py:155: in run
    self._train_pnorm_fast()
abkhazia/acoustic/neural_network.py:205: in _train_pnorm_fast
    self._run_am_command(command, target, message)
abkhazia/acoustic/abstract_acoustic_model.py:140: in _run_am_command
    self._run_command(command, verbose=False)
abkhazia/abstract_recipe.py:102: in _run_command
    cwd=self.recipe_dir)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

command = 'steps/nnet2/train_pnorm_fast.sh --cmd "queue.pl -q all.q@puck*.cm.cluster --config /fhgfs/bootphon/scratch/thomas/abk.../data/acoustic /home/thomas/tmpdir/lm_word0 /home/thomas/tmpdir/am_trisa0 /home/thomas/tmpdir/am_nnet0/recipe/exp/nnet'
stdin = None, stdout = <bound method RootLogger.debug of <logging.RootLogger object at 0x2aaaaf4eead0>>, cwd = '/home/thomas/tmpdir/am_nnet0/recipe'
env = {'SSH_ASKPASS': '/usr/libexec/openssh/gnome-ssh-askpass', 'MODULE_VERSION': '3.2.6', 'kaldi_steps': '/home/thomas/tmpd...C_module()': '() {  eval `/cm/local/apps/environment-modules/3.2.6//Modules/$MODULE_VERSION/bin/modulecmd bash $*`\n}'}
returncode = 0

    def run(command, stdin=None, stdout=sys.stdout.write,
            cwd=None, env=os.environ, returncode=0):
        """Run 'command' as a subprocess

        command : string to be executed as a subprocess

        stdout : standard output/error redirection function. By default
            redirect the output to stdout, but you can redirect to a
            logger with stdout=log.debug for exemple. Use
            stdout=open(os.devnull, 'w').write to ignore the command
            output.

        stdin : standard input redirection, can be a file or any readable
            stream.

        cwd : current working directory for executing the command

        env : current environment for executing the command

        returncode : expected return code of the command

        Returns silently if the command returned with `returncode`, else
        raise a RuntimeError

        """
        job = subprocess.Popen(
            shlex.split(command),
            stdin=stdin,
            stdout=subprocess.PIPE,
            stderr=subprocess.STDOUT,
            cwd=cwd, env=env)

        # join the command output to log (from
        # https://stackoverflow.com/questions/35488927)
        def consume_lines(pipe, consume):
            with pipe:
                # NOTE: workaround read-ahead bug
                for line in iter(pipe.readline, b''):
                    consume(line)
                consume('\n')

        threading.Thread(
            target=consume_lines,
            args=[job.stdout, lambda line: stdout(line)]).start()

        job.wait()

        if job.returncode != returncode:
            raise RuntimeError('command "{}" returned with {}'
>                              .format(command, job.returncode))
E           RuntimeError: command "steps/nnet2/train_pnorm_fast.sh --cmd "queue.pl -q all.q@puck*.cm.cluster --config /fhgfs/bootphon/scratch/thomas/abkhazia2017/abkhazia/abkhazia/share/queue.conf" --num-hidden-layers 1 --presoftmax-prior-scale-power -0.25 --num-iters-final 1 --bias-stddev 0.5 --initial-learning-rate 0.04 --randprune 4.0 --target-multiplier 0 --minibatch-size 128 --num-epochs-extra 1 --shuffle-buffer-size 500 --final-learning-rate 0.004 --splice-width 4 --alpha 4.0 --pnorm-output-dim 10 --samples-per-iter 200000 --add-layers-period 2 --num-epochs 2 --p 2 --pnorm-input-dim 100 --mix-up 0 --io-opts "" --egs-opts "--num-utts-subset 20" --num-threads 20 --parallel-opts "--num-threads 20" --combine-num-threads 8 --combine-parallel-opts "--num-threads 8" /home/thomas/tmpdir/am_nnet0/recipe/data/acoustic /home/thomas/tmpdir/lm_word0 /home/thomas/tmpdir/am_trisa0 /home/thomas/tmpdir/am_nnet0/recipe/exp/nnet" returned with 1

abkhazia/utils/jobs.py:73: RuntimeError
----------------------------------------------------------------------- Captured stdout setup -----------------------------------------------------------------------
training neural network
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
=============================================================== 22 passed, 1 error in 2201.34 seconds ===============================================================

The log shows that there were a lot of errors during training:

[thomas@oberon am_nnet0]$ more am_nnet.log
2017-07-03 20:49:46,006 - INFO - training neural network
2017-07-03 20:49:46,025 - DEBUG - steps/nnet2/train_pnorm_fast.sh --cmd queue.pl -q all.q@puck*.cm.cluster --config /fhgfs/bootphon/scratch/thom
as/abkhazia2017/abkhazia/abkhazia/share/queue.conf --num-hidden-layers 1 --presoftmax-prior-scale-power -0.25 --num-iters-final 1 --bias-stddev
0.5 --initial-learning-rate 0.04 --randprune 4.0 --target-multiplier 0 --minibatch-size 128 --num-epochs-extra 1 --shuffle-buffer-size 500 --fin
al-learning-rate 0.004 --splice-width 4 --alpha 4.0 --pnorm-output-dim 10 --samples-per-iter 200000 --add-layers-period 2 --num-epochs 2 --p 2 -
-pnorm-input-dim 100 --mix-up 0 --io-opts  --egs-opts --num-utts-subset 20 --num-threads 20 --parallel-opts --num-threads 20 --combine-num-threa
ds 8 --combine-parallel-opts --num-threads 8 /home/thomas/tmpdir/am_nnet0/recipe/data/acoustic /home/thomas/tmpdir/lm_word0 /home/thomas/tmpdir/
am_trisa0 /home/thomas/tmpdir/am_nnet0/recipe/exp/nnet
2017-07-03 20:49:47,105 - DEBUG - steps/nnet2/train_pnorm_fast.sh: calling get_lda.sh
2017-07-03 20:49:47,108 - DEBUG - steps/nnet2/get_lda.sh --transform-dir /home/thomas/tmpdir/am_trisa0 --splice-width 4 --cmd queue.pl -q all.q@
puck*.cm.cluster --config /fhgfs/bootphon/scratch/thomas/abkhazia2017/abkhazia/abkhazia/share/queue.conf /home/thomas/tmpdir/am_nnet0/recipe/dat
a/acoustic /home/thomas/tmpdir/lm_word0 /home/thomas/tmpdir/am_trisa0 /home/thomas/tmpdir/am_nnet0/recipe/exp/nnet
2017-07-03 20:49:47,158 - DEBUG - steps/nnet2/get_lda.sh: feature type is raw
2017-07-03 20:49:47,168 - DEBUG - feat-to-dim 'ark,s,cs:utils/subset_scp.pl --quiet 500 /home/thomas/tmpdir/am_nnet0/recipe/data/acoustic/split2
0/1/feats.scp | apply-cmvn  --utt2spk=ark:/home/thomas/tmpdir/am_nnet0/recipe/data/acoustic/split20/1/utt2spk scp:/home/thomas/tmpdir/am_nnet0/r
ecipe/data/acoustic/split20/1/cmvn.scp scp:- ark:- |' -
2017-07-03 20:49:47,176 - DEBUG - apply-cmvn --utt2spk=ark:/home/thomas/tmpdir/am_nnet0/recipe/data/acoustic/split20/1/utt2spk scp:/home/thomas/
tmpdir/am_nnet0/recipe/data/acoustic/split20/1/cmvn.scp scp:- ark:-
2017-07-03 20:49:47,177 - DEBUG - ERROR (apply-cmvn:Write():kaldi-matrix.cc:1143) Failed to write matrix to stream
2017-07-03 20:49:47,178 - DEBUG - WARNING (apply-cmvn:Write():util/kaldi-holder-inl.h:51) Exception caught writing Table object: ERROR (apply-cm
vn:Write():kaldi-matrix.cc:1143) Failed to write matrix to stream
2017-07-03 20:49:47,178 - DEBUG - [stack trace: ]
2017-07-03 20:49:47,178 - DEBUG - kaldi::KaldiGetStackTrace()
2017-07-03 20:49:47,179 - DEBUG - kaldi::KaldiErrorMessage::~KaldiErrorMessage()
2017-07-03 20:49:47,179 - DEBUG - kaldi::MatrixBase<float>::Write(std::ostream&, bool) const
2017-07-03 20:49:47,179 - DEBUG - kaldi::KaldiObjectHolder<kaldi::Matrix<float> >::Write(std::ostream&, bool, kaldi::Matrix<float> const&)
2017-07-03 20:49:47,179 - DEBUG - kaldi::TableWriterArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::Write(std::string const&, kal
di::Matrix<float> const&)
2017-07-03 20:49:47,180 - DEBUG - kaldi::TableWriter<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::Write(std::string const&, kaldi::Matrix<
float> const&) const
2017-07-03 20:49:47,180 - DEBUG - apply-cmvn(main+0x6fa) [0x465616]
2017-07-03 20:49:47,180 - DEBUG - /lib64/libc.so.6(__libc_start_main+0xfd) [0x30c2a1ed5d]
2017-07-03 20:49:47,180 - DEBUG - apply-cmvn() [0x464e39]
2017-07-03 20:49:47,180 - DEBUG - WARNING (apply-cmvn:Write():util/kaldi-table-inl.h:693) TableWriter: write failure to standard output
2017-07-03 20:49:47,181 - DEBUG - ERROR (apply-cmvn:Write():util/kaldi-table-inl.h:1142) Error in TableWriter::Write
2017-07-03 20:49:47,181 - DEBUG - WARNING (apply-cmvn:Close():util/kaldi-table-inl.h:724) TableWriter: error closing stream: standard output
2017-07-03 20:49:47,181 - DEBUG - ERROR (apply-cmvn:~TableWriter():util/kaldi-table-inl.h:1165) Error closing TableWriter [in destructor].
2017-07-03 20:49:47,181 - DEBUG - sh: line 1:  7113 Done                    utils/subset_scp.pl --quiet 500 /home/thomas/tmpdir/am_nnet0/recipe/
data/acoustic/split20/1/feats.scp
2017-07-03 20:49:47,181 - DEBUG - 7114 Aborted                 | apply-cmvn --utt2spk=ark:/home/thomas/tmpdir/am_nnet0/recipe/data/acoustic/spli
t20/1/utt2spk scp:/home/thomas/tmpdir/am_nnet0/recipe/data/acoustic/split20/1/cmvn.scp scp:- ark:-
2017-07-03 20:49:47,182 - DEBUG - WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe utils/subset_scp.pl --quiet 500 /home/thomas/tmpdir/am_nnet
0/recipe/data/acoustic/split20/1/feats.scp | apply-cmvn  --utt2spk=ark:/home/thomas/tmpdir/am_nnet0/recipe/data/acoustic/split20/1/utt2spk scp:/
home/thomas/tmpdir/am_nnet0/recipe/data/acoustic/split20/1/cmvn.scp scp:- ark:- | had nonzero return status 34304
2017-07-03 20:49:47,189 - DEBUG - feat-to-dim 'ark,s,cs:utils/subset_scp.pl --quiet 500 /home/thomas/tmpdir/am_nnet0/recipe/data/acoustic/split2
0/1/feats.scp | apply-cmvn  --utt2spk=ark:/home/thomas/tmpdir/am_nnet0/recipe/data/acoustic/split20/1/utt2spk scp:/home/thomas/tmpdir/am_nnet0/r
ecipe/data/acoustic/split20/1/cmvn.scp scp:- ark:- | splice-feats --left-context=4 --right-context=4 ark:- ark:- |' -
2017-07-03 20:49:47,196 - DEBUG - apply-cmvn --utt2spk=ark:/home/thomas/tmpdir/am_nnet0/recipe/data/acoustic/split20/1/utt2spk scp:/home/thomas/
tmpdir/am_nnet0/recipe/data/acoustic/split20/1/cmvn.scp scp:- ark:-
2017-07-03 20:49:47,246 - DEBUG - splice-feats --left-context=4 --right-context=4 ark:- ark:-
2017-07-03 20:49:47,247 - DEBUG - ERROR (splice-feats:Write():kaldi-matrix.cc:1143) Failed to write matrix to stream
2017-07-03 20:49:47,249 - DEBUG - WARNING (splice-feats:Write():util/kaldi-holder-inl.h:51) Exception caught writing Table object: ERROR (splice
-feats:Write():kaldi-matrix.cc:1143) Failed to write matrix to stream
2017-07-03 20:49:47,249 - DEBUG - [stack trace: ]
2017-07-03 20:49:47,249 - DEBUG - kaldi::KaldiGetStackTrace()
2017-07-03 20:49:47,249 - DEBUG - kaldi::KaldiErrorMessage::~KaldiErrorMessage()
2017-07-03 20:49:47,249 - DEBUG - kaldi::MatrixBase<float>::Write(std::ostream&, bool) const
2017-07-03 20:49:47,250 - DEBUG - kaldi::KaldiObjectHolder<kaldi::Matrix<float> >::Write(std::ostream&, bool, kaldi::Matrix<float> const&)
2017-07-03 20:49:47,250 - DEBUG - kaldi::TableWriterArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::Write(std::string const&, kal
di::Matrix<float> const&)
2017-07-03 20:49:47,250 - DEBUG - kaldi::TableWriter<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::Write(std::string const&, kaldi::Matrix<
float> const&) const
2017-07-03 20:49:47,250 - DEBUG - splice-feats(main+0x292) [0x454f6e]
2017-07-03 20:49:47,251 - DEBUG - /lib64/libc.so.6(__libc_start_main+0xfd) [0x30c2a1ed5d]
2017-07-03 20:49:47,251 - DEBUG - splice-feats() [0x454bf9]
2017-07-03 20:49:47,251 - DEBUG - WARNING (splice-feats:Write():util/kaldi-table-inl.h:693) TableWriter: write failure to standard output
2017-07-03 20:49:47,251 - DEBUG - ERROR (splice-feats:Write():util/kaldi-table-inl.h:1142) Error in TableWriter::Write
2017-07-03 20:49:47,252 - DEBUG - WARNING (splice-feats:Close():util/kaldi-table-inl.h:724) TableWriter: error closing stream: standard output
2017-07-03 20:49:47,252 - DEBUG - ERROR (splice-feats:~TableWriter():util/kaldi-table-inl.h:1165) Error closing TableWriter [in destructor].
2017-07-03 20:49:47,252 - DEBUG - ERROR (apply-cmvn:Write():kaldi-matrix.cc:1143) Failed to write matrix to stream
2017-07-03 20:49:47,252 - DEBUG - WARNING (apply-cmvn:Write():util/kaldi-holder-inl.h:51) Exception caught writing Table object: ERROR (apply-cm
vn:Write():kaldi-matrix.cc:1143) Failed to write matrix to stream
2017-07-03 20:49:47,252 - DEBUG - [stack trace: ]
2017-07-03 20:49:47,252 - DEBUG - kaldi::KaldiGetStackTrace()
2017-07-03 20:49:47,253 - DEBUG - kaldi::KaldiErrorMessage::~KaldiErrorMessage()
2017-07-03 20:49:47,253 - DEBUG - kaldi::MatrixBase<float>::Write(std::ostream&, bool) const
2017-07-03 20:49:47,253 - DEBUG - kaldi::KaldiObjectHolder<kaldi::Matrix<float> >::Write(std::ostream&, bool, kaldi::Matrix<float> const&)
2017-07-03 20:49:47,253 - DEBUG - kaldi::TableWriterArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::Write(std::string const&, kal
di::Matrix<float> const&)
2017-07-03 20:49:47,253 - DEBUG - kaldi::TableWriter<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::Write(std::string const&, kaldi::Matrix<
float> const&) const
2017-07-03 20:49:47,254 - DEBUG - apply-cmvn(main+0x6fa) [0x465616]
2017-07-03 20:49:47,254 - DEBUG - /lib64/libc.so.6(__libc_start_main+0xfd) [0x30c2a1ed5d]
2017-07-03 20:49:47,254 - DEBUG - apply-cmvn() [0x464e39]
2017-07-03 20:49:47,254 - DEBUG - WARNING (apply-cmvn:Write():util/kaldi-table-inl.h:693) TableWriter: write failure to standard output
2017-07-03 20:49:47,254 - DEBUG - ERROR (apply-cmvn:Write():util/kaldi-table-inl.h:1142) Error in TableWriter::Write
2017-07-03 20:49:47,255 - DEBUG - WARNING (apply-cmvn:Close():util/kaldi-table-inl.h:724) TableWriter: error closing stream: standard output
2017-07-03 20:49:47,255 - DEBUG - ERROR (apply-cmvn:~TableWriter():util/kaldi-table-inl.h:1165) Error closing TableWriter [in destructor].
2017-07-03 20:49:47,255 - DEBUG - sh: line 1:  7120 Done                    utils/subset_scp.pl --quiet 500 /home/thomas/tmpdir/am_nnet0/recipe/
data/acoustic/split20/1/feats.scp
2017-07-03 20:49:47,255 - DEBUG - 7121 Aborted                 | apply-cmvn --utt2spk=ark:/home/thomas/tmpdir/am_nnet0/recipe/data/acoustic/spli
t20/1/utt2spk scp:/home/thomas/tmpdir/am_nnet0/recipe/data/acoustic/split20/1/cmvn.scp scp:- ark:-
2017-07-03 20:49:47,255 - DEBUG - 7122 Aborted                 | splice-feats --left-context=4 --right-context=4 ark:- ark:-
2017-07-03 20:49:47,256 - DEBUG - WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe utils/subset_scp.pl --quiet 500 /home/thomas/tmpdir/am_nnet
0/recipe/data/acoustic/split20/1/feats.scp | apply-cmvn  --utt2spk=ark:/home/thomas/tmpdir/am_nnet0/recipe/data/acoustic/split20/1/utt2spk scp:/
home/thomas/tmpdir/am_nnet0/recipe/data/acoustic/split20/1/cmvn.scp scp:- ark:- | splice-feats --left-context=4 --right-context=4 ark:- ark:- |
had nonzero return status 34304
2017-07-03 20:49:47,256 - DEBUG - steps/nnet2/get_lda.sh: Accumulating LDA statistics.
2017-07-03 20:50:00,806 - DEBUG - steps/nnet2/get_lda.sh: Finished estimating LDA
2017-07-03 20:50:00,811 - DEBUG - steps/nnet2/train_pnorm_fast.sh: calling get_egs.sh
2017-07-03 20:50:00,814 - DEBUG - steps/nnet2/get_egs.sh --num-utts-subset 20 --transform-dir /home/thomas/tmpdir/am_trisa0 --splice-width 4 --s
amples-per-iter 200000 --num-jobs-nnet 16 --stage 0 --cmd queue.pl -q all.q@puck*.cm.cluster --config /fhgfs/bootphon/scratch/thomas/abkhazia201
7/abkhazia/abkhazia/share/queue.conf --num-utts-subset 20 --io-opts  /home/thomas/tmpdir/am_nnet0/recipe/data/acoustic /home/thomas/tmpdir/lm_wo
rd0 /home/thomas/tmpdir/am_trisa0 /home/thomas/tmpdir/am_nnet0/recipe/exp/nnet
2017-07-03 20:50:00,888 - DEBUG - steps/nnet2/get_egs.sh: feature type is raw
2017-07-03 20:50:00,889 - DEBUG - steps/nnet2/get_egs.sh: working out number of frames of training data
2017-07-03 20:50:00,905 - DEBUG - utils/data/get_utt2dur.sh: working out /home/thomas/tmpdir/am_nnet0/recipe/data/acoustic/utt2dur from /home/th
omas/tmpdir/am_nnet0/recipe/data/acoustic/segments
2017-07-03 20:50:00,915 - DEBUG - utils/data/get_utt2dur.sh: computed /home/thomas/tmpdir/am_nnet0/recipe/data/acoustic/utt2dur
2017-07-03 20:50:00,994 - DEBUG - feat-to-len scp:/home/thomas/tmpdir/am_nnet0/recipe/data/acoustic/feats.scp ark,t:-
2017-07-03 20:50:01,204 - DEBUG - WARNING (feat-to-len:Write():util/kaldi-holder-inl.h:122) Exception caught writing Table object: Write failure
 in WriteBasicType.
2017-07-03 20:50:01,204 - DEBUG - Write failure in WriteBasicType.WARNING (feat-to-len:Write():util/kaldi-table-inl.h:693) TableWriter: write fa
ilure to standard output
2017-07-03 20:50:01,205 - DEBUG - ERROR (feat-to-len:Write():util/kaldi-table-inl.h:1142) Error in TableWriter::Write
2017-07-03 20:50:01,205 - DEBUG - WARNING (feat-to-len:Close():util/kaldi-table-inl.h:724) TableWriter: error closing stream: standard output
2017-07-03 20:50:01,205 - DEBUG - ERROR (feat-to-len:~TableWriter():util/kaldi-table-inl.h:1165) Error closing TableWriter [in destructor].
2017-07-03 20:50:01,215 - DEBUG - steps/nnet2/get_egs.sh: Every epoch, splitting the data up into 1 iterations,
2017-07-03 20:50:01,215 - DEBUG - steps/nnet2/get_egs.sh: giving samples-per-iteration of 59523 (you requested 200000).
2017-07-03 20:50:09,806 - DEBUG - Getting validation and training subset examples.
2017-07-03 20:50:09,808 - DEBUG - steps/nnet2/get_egs.sh: extracting validation and training-subset alignments.
2017-07-03 20:50:09,835 - DEBUG - copy-int-vector ark:- ark,t:-
2017-07-03 20:50:10,205 - DEBUG - LOG (copy-int-vector:main():copy-int-vector.cc:83) Copied 999 vectors of int32.
2017-07-03 20:50:16,316 - DEBUG - Getting subsets of validation examples for diagnostics and combination.
2017-07-03 20:50:21,069 - DEBUG - Creating training examples
2017-07-03 20:50:21,071 - DEBUG - Generating training examples on disk
2017-07-03 20:50:32,249 - DEBUG - steps/nnet2/get_egs.sh: rearranging examples into parts for different parallel jobs
2017-07-03 20:50:32,249 - DEBUG - steps/nnet2/get_egs.sh: Since iters-per-epoch == 1, just concatenating the data.
2017-07-03 20:50:33,374 - DEBUG - Shuffling the order of training examples
2017-07-03 20:50:33,375 - DEBUG - (in order to avoid stressing the disk, these won't all run at once).
2017-07-03 20:50:41,666 - DEBUG - steps/nnet2/get_egs.sh: Finished preparing training examples
2017-07-03 20:50:41,671 - DEBUG - steps/nnet2/train_pnorm_fast.sh: initializing neural net
2017-07-03 20:50:41,709 - DEBUG - Usage: queue.pl [options] [JOB=1:n] log-file command-line arguments...
2017-07-03 20:50:41,709 - DEBUG - e.g.: queue.pl foo.log echo baz
2017-07-03 20:50:41,710 - DEBUG - (which will echo "baz", with stdout and stderr directed to foo.log)
2017-07-03 20:50:41,710 - DEBUG - or: queue.pl -q all.q@xyz foo.log echo bar | sed s/bar/baz/
2017-07-03 20:50:41,710 - DEBUG - (which is an example of using a pipe; you can provide other escaped bash constructs)
2017-07-03 20:50:41,710 - DEBUG - or: queue.pl -q all.q@qyz JOB=1:10 foo.JOB.log echo JOB
2017-07-03 20:50:41,710 - DEBUG - (which illustrates the mechanism to submit parallel jobs; note, you can use
2017-07-03 20:50:41,711 - DEBUG - another string other than JOB)
2017-07-03 20:50:41,711 - DEBUG - Note: if you pass the "-sync y" option to qsub, this script will take note
2017-07-03 20:50:41,711 - DEBUG - and change its behavior.  Otherwise it uses qstat to work out when the job finished
2017-07-03 20:50:41,711 - DEBUG - Options:
2017-07-03 20:50:41,711 - DEBUG - --config <config-file> (default: conf/queue.conf)
2017-07-03 20:50:41,712 - DEBUG - --mem <mem-requirement> (e.g. --mem 2G, --mem 500M,
2017-07-03 20:50:41,712 - DEBUG - also support K and numbers mean bytes)
2017-07-03 20:50:41,712 - DEBUG - --num-threads <num-threads> (default: 1)
2017-07-03 20:50:41,712 - DEBUG - --max-jobs-run <num-jobs>
2017-07-03 20:50:41,712 - DEBUG - --gpu <0|1> (default: 0)
2017-07-03 20:50:41,832 - DEBUG - nnet-am-init /home/thomas/tmpdir/am_trisa0/tree /home/thomas/tmpdir/lm_word0/topo 'nnet-init /home/thomas/tmpd
ir/am_nnet0/recipe/exp/nnet/nnet.config -|' /home/thomas/tmpdir/am_nnet0/recipe/exp/nnet/0.mdl
2017-07-03 20:50:41,956 - DEBUG - nnet-init /home/thomas/tmpdir/am_nnet0/recipe/exp/nnet/nnet.config -
2017-07-03 20:50:41,960 - DEBUG - LOG (nnet-init:main():nnet-init.cc:71) Initialized raw neural net and wrote it to -
2017-07-03 20:50:41,962 - DEBUG - LOG (nnet-am-init:main():nnet-am-init.cc:103) Initialized neural net and wrote it to /home/thomas/tmpdir/am_nn
et0/recipe/exp/nnet/0.mdl
2017-07-03 20:50:41,963 - DEBUG - Training transition probabilities and setting priors
2017-07-03 20:50:45,119 - DEBUG - prepare vector assignment for FixedScaleComponent before softmax
2017-07-03 20:50:45,120 - DEBUG - (use priors^-0.25 and rescale to average 1)
2017-07-03 20:50:53,293 - DEBUG - queue.pl: 20 / 20 failed, log is in /home/thomas/tmpdir/am_nnet0/recipe/exp/nnet/log/acc_pdf.*.log

Most errors seem related to features writing starting with cmvn computations, but they did not seem to stop the program from running.

The logs in /home/thomas/tmpdir/am_nnet0/recipe/exp/nnet/log/acc_pdf.*.log all look like this:

::::::::::::::
/home/thomas/tmpdir/am_nnet0/recipe/exp/nnet/log/acc_pdf.14.log
::::::::::::::
# Running on puck1
# Started at Mon Jul 3 20:50:49 CEST 2017
# ali-to-post "ark:gunzip -c /home/thomas/tmpdir/am_trisa0/ali.14.gz|" ark:- | post-to-tacc --per-pdf=true --binary=false /home/thomas/tmpdir/am
_trisa0/final.mdl ark:- /home/thomas/tmpdir/am_nnet0/recipe/exp/nnet/14.pacc
ali-to-post 'ark:gunzip -c /home/thomas/tmpdir/am_trisa0/ali.14.gz|' ark:-

From posteriors, compute transition-accumulators
The output is a vector of counts/soft-counts, indexed by transition-id)
Note: the model is only read in order to get the size of the vector

Usage: post-to-tacc [options] <model> <post-rspecifier> <accs>
 e.g.: post-to-tacc --binary=false 1.mdl "ark:ali-to-post 1.ali|" 1.tacc

Options:
  --binary                    : Write output in binary mode. (bool, default = true)

Standard options:
  --config                    : Configuration file to read (this option may be repeated) (string, default = "")
  --help                      : Print out usage message (bool, default = false)
  --print-args                : Print the command line arguments (to stderr) (bool, default = true)
  --verbose                   : Verbose level (higher->more logging) (int, default = 0)

Command line was: post-to-tacc --per-pdf=true --binary=false /home/thomas/tmpdir/am_trisa0/final.mdl ark:- /home/thomas/tmpdir/am_nnet0/recipe/e
xp/nnet/14.pacc
ERROR (post-to-tacc:Read():parse-options.cc:375) Invalid option --per-pdf=true
ERROR (post-to-tacc:Read():parse-options.cc:375) Invalid option --per-pdf=true

[stack trace: ]
kaldi::KaldiGetStackTrace()
kaldi::KaldiErrorMessage::~KaldiErrorMessage()
kaldi::ParseOptions::Read(int, char const* const*)
post-to-tacc(main+0x112) [0x4cd52e]
/lib64/libc.so.6(__libc_start_main+0xfd) [0x362c01ed1d]
post-to-tacc() [0x4cd339]

# Accounting: time=0 threads=1
# Finished at Mon Jul 3 20:50:49 CEST 2017 with status 255

Apparently the fatal error was caused by passing the unrecognized --per-pdf=true option to the post-to-tacc kaldi utility.

I'll try to run the test once again to see if this is reproducible.

mmmaat commented 7 years ago

Hi Thomas, actually I improved the test suite (using only 50 determined utterances from 4 speakers, instead of 1000 random). I also found and fixed few minor bugs...

By my side, all the tests are now passing, either on my desktop computer, or on the cluster (using run.pl or queue.pl).

So you rerun the tests, they should pass !

Thomas-Schatz commented 7 years ago

I just re-ran the test on a fresh install on Oberon and got the same error as previously for the nnet training test.

Specifically, I ran:

module load python-anaconda
conda create --name abkhazia2017 python=2 anaconda
source activate abkhazia2017
mkdir abkhazia2017
cd abkhazia2017
git clone https://github.com/bootphon/abkhazia.git
cd abkhazia
module load kaldi
KALDI_PATH=/home/mbernard/dev/abkhazia/kaldi ./configure
python setup.py build
pip install h5features --upgrade
python setup.py develop

I edited the abkhazia.conf file as follows:

# This is the abkhazia configuration file. This file is automatically
# generated during installation. Change the values in here to overload
# the default configuration.

[abkhazia]
# The absolute path to the output data directory of abkhazia.
data-directory:

# The directory where abkhazia write temporary data (usually /tmp or
# /dev/shm).
tmp-directory: /tmp

[kaldi]
# The absolute path to the kaldi distribution directory
kaldi-directory: /home/mbernard/dev/abkhazia/kaldi

# "queue.pl" uses qsub. The options to it are options to qsub.  If you
# have GridEngine installed, change this to a queue you have access
# to. Otherwise, use "run.pl", which will run jobs locally

# On Oberon use:
train-cmd: queue.pl -q all.q@puck*.cm.cluster
decode-cmd: queue.pl -q all.q@puck*.cm.cluster
highmem-cmd: queue.pl -q all.q@puck*.cm.cluster

# On Eddie use:
# train-cmd: queue.pl -P inf_hcrc_cstr_general
# decode-cmd: queue.pl -P inf_hcrc_cstr_general
# highmem-cmd: queue.pl -P inf_hcrc_cstr_general -pe memory-2G 2

# To run locally use:
# train-cmd: run.pl
# decode-cmd: run.pl
# highmem-cmd: run.pl

[corpus]
[thomas@oberon share]$ cat abkhazia.conf
# This is the abkhazia configuration file. This file is automatically
# generated during installation. Change the values in here to overload
# the default configuration.

[abkhazia]
# The absolute path to the output data directory of abkhazia.
data-directory:

# The directory where abkhazia write temporary data (usually /tmp or
# /dev/shm).
tmp-directory: /tmp

[kaldi]
# The absolute path to the kaldi distribution directory
kaldi-directory: /home/mbernard/dev/abkhazia/kaldi

# "queue.pl" uses qsub. The options to it are options to qsub.  If you
# have GridEngine installed, change this to a queue you have access
# to. Otherwise, use "run.pl", which will run jobs locally

# On Oberon use:
train-cmd: queue.pl -q all.q@puck*.cm.cluster
decode-cmd: queue.pl -q all.q@puck*.cm.cluster
highmem-cmd: queue.pl -q all.q@puck*.cm.cluster

# On Eddie use:
# train-cmd: queue.pl -P inf_hcrc_cstr_general
# decode-cmd: queue.pl -P inf_hcrc_cstr_general
# highmem-cmd: queue.pl -P inf_hcrc_cstr_general -pe memory-2G 2

# To run locally use:
# train-cmd: run.pl
# decode-cmd: run.pl
# highmem-cmd: run.pl

[corpus]
# In this section you can specify the default input directory where to
# read raw data for each supported corpus. By doing so, the
# <input-dir> argument of 'abkhazia prepare <corpus>' becomes optional
# for the corpus you have specified directories here.
aic-directory:
buckeye-directory: /scratch1/data/raw_data/BUCKEYE_revised_bootphon
childes-directory:
cid-directory:
csj-directory:
globalphone-directory:
librispeech-directory:
wsj-directory:
xitsonga-directory:

Then ran the tests:

screen
module load python-anaconda
source activate abkhazia2017 
pytest ./test --basetemp=/home/thomas/tmpdir -x -v

The pytest output is below.

pytest ./test --basetemp=/home/thomas/tmpdir -x -v
============================= test session starts ==============================
platform linux2 -- Python 2.7.13, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /home/thomas/.conda/envs/abkhazia2017/bin/python
cachedir: .cache
rootdir: /scratch1/users/thomas/abkhazia2017/abkhazia, inifile:
collected 52 items

test/test_acoustic.py::test_acoustic_njobs[4] PASSED
test/test_acoustic.py::test_monophone_cmvn_good PASSED
test/test_acoustic.py::test_monophone_cmvn_bad PASSED
test/test_align.py::test_align[both-False] PASSED
test/test_ark.py::test_read_write[text] PASSED
test/test_ark.py::test_read_write[binary] PASSED
test/test_ark.py::test_h5f_name_of_utterance[a] PASSED
test/test_ark.py::test_h5f_name_of_utterance[a-b] PASSED
test/test_ark.py::test_h5f_name_of_utterance[a_b] PASSED
test/test_ark.py::test_h5f_twice PASSED
test/test_corpus.py::test_save_corpus[True] PASSED
test/test_corpus.py::test_save_corpus[False] PASSED
test/test_corpus.py::test_empty PASSED
test/test_corpus.py::test_subcorpus PASSED
test/test_corpus.py::test_split PASSED
test/test_corpus.py::test_split_tiny_train PASSED
test/test_corpus.py::test_split_by_speakers PASSED
test/test_corpus.py::test_split_and_save[True] PASSED
test/test_corpus.py::test_split_and_save[False] PASSED
test/test_corpus.py::test_split_less_than_1[True] PASSED
test/test_corpus.py::test_split_less_than_1[False] PASSED
test/test_corpus.py::test_spk2utt PASSED
test/test_corpus.py::test_phonemize_text PASSED
test/test_corpus.py::test_phonemize_corpus PASSED
test/test_decode.py::test_decode_mono[True] PASSED
test/test_decode.py::test_decode_mono[False] PASSED
test/test_decode.py::test_decode_tri[True] PASSED
test/test_decode.py::test_decode_tri[False] PASSED
test/test_decode.py::test_decode_trisa[True] PASSED
test/test_decode.py::test_decode_trisa[False] PASSED
test/test_decode.py::test_decode_nnet[True] ERROR

================================================================= ERRORS ==================================================================
________________________________________________ ERROR at setup of test_decode_nnet[True] _________________________________________________

corpus = <abkhazia.corpus.corpus.Corpus object at 0x2aaab52c4050>, features = '/home/thomas/tmpdir/features0'
am_trisa = '/home/thomas/tmpdir/am_trisa0', tmpdir_factory = <_pytest.tmpdir.TempdirFactory instance at 0x2aaada7fdb00>
lang_args = {'keep_tmp_dirs': True, 'level': 'word', 'position_dependent_phones': False, 'silence_probability': 0.5}

    @pytest.fixture(scope='session')
    def am_nnet(corpus, features, am_trisa, tmpdir_factory, lang_args):
        output_dir = str(tmpdir_factory.mktemp('am_nnet'))
        flog = os.path.join(output_dir, 'am_nnet.log')
        log = utils.logger.get_log(flog)
        am = acoustic.NeuralNetwork(
            corpus, features, am_trisa, output_dir, lang_args, log=log)

        am.options['num-epochs'].value = 2
        am.options['num-epochs-extra'].value = 1
        am.options['num-hidden-layers'].value = 1
        am.options['num-iters-final'].value = 1
        am.options['pnorm-input-dim'].value = 1
        am.options['pnorm-output-dim'].value = 1
        am.options['num-utts-subset'].value = 2
>       am.compute()

test/conftest.py:246:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
abkhazia/abstract_recipe.py:185: in compute
    self.run()
abkhazia/acoustic/neural_network.py:167: in run
    self._train_pnorm_fast()
abkhazia/acoustic/neural_network.py:217: in _train_pnorm_fast
    self._run_am_command(command, target, message)
abkhazia/acoustic/abstract_acoustic_model.py:170: in _run_am_command
    self._run_command(command, verbose=False)
abkhazia/abstract_recipe.py:102: in _run_command
    cwd=self.recipe_dir)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

command = 'steps/nnet2/train_pnorm_fast.sh --cmd "queue.pl -q all.q@puck*.cm.cluster --config /scratch1/users/thomas/abkhazia201.../acoustic /home/thomas/tmpdir/am_nnet0/lang /home/thomas/tmpdir/am_trisa0 /home/thomas/tmpdir/am_nnet0/recipe/exp/nnet'
stdin = None, stdout = <bound method RootLogger.debug of <logging.RootLogger object at 0x2aaab51df7d0>>
cwd = '/home/thomas/tmpdir/am_nnet0/recipe'
env = {'SSH_ASKPASS': '/usr/libexec/openssh/gnome-ssh-askpass', 'MODULE_VERSION': '3.2.6', 'CUDA_ROOT': '/cm/local/apps/cuda...ka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:'}
returncode = 0

    def run(command, stdin=None, stdout=sys.stdout.write,
            cwd=None, env=os.environ, returncode=0):
        """Run 'command' as a subprocess

        command : string to be executed as a subprocess

        stdout : standard output/error redirection function. By default
            redirect the output to stdout, but you can redirect to a
            logger with stdout=log.debug for exemple. Use
            stdout=open(os.devnull, 'w').write to ignore the command
            output.

        stdin : standard input redirection, can be a file or any readable
            stream.

        cwd : current working directory for executing the command

        env : current environment for executing the command

        returncode : expected return code of the command

        Returns silently if the command returned with `returncode`, else
        raise a RuntimeError

        """
        job = subprocess.Popen(
            shlex.split(command),
            stdin=stdin,
            stdout=subprocess.PIPE,
            stderr=subprocess.STDOUT,
            cwd=cwd, env=env)

        # join the command output to log (from
        # https://stackoverflow.com/questions/35488927)
        def consume_lines(pipe, consume):
            with pipe:
                # NOTE: workaround read-ahead bug
                for line in iter(pipe.readline, b''):
                    consume(line)
                consume('\n')

        threading.Thread(
            target=consume_lines,
            args=[job.stdout, lambda line: stdout(line)]).start()

        job.wait()

        if job.returncode != returncode:
            raise RuntimeError('command "{}" returned with {}'
>                              .format(command, job.returncode))
E           RuntimeError: command "steps/nnet2/train_pnorm_fast.sh --cmd "queue.pl -q all.q@puck*.cm.cluster --config /scratch1/users/thomas/abkhazia2017/abkhazia/abkhazia/share/queue.conf" --num-hidden-layers 1 --presoftmax-prior-scale-power -0.25 --num-iters-final 1 --bias-stddev 0.5 --initial-learning-rate 0.04 --randprune 4.0 --target-multiplier 0 --minibatch-size 128 --num-epochs-extra 1 --shuffle-buffer-size 500 --final-learning-rate 0.004 --splice-width 4 --alpha 4.0 --pnorm-output-dim 1 --samples-per-iter 200000 --add-layers-period 2 --num-epochs 2 --p 2 --pnorm-input-dim 1 --mix-up 0 --io-opts "" --egs-opts "--num-utts-subset 2" --num-threads 3 --parallel-opts "--num-threads 3" --combine-num-threads 8 --combine-parallel-opts "--num-threads 8" /home/thomas/tmpdir/am_nnet0/recipe/data/acoustic /home/thomas/tmpdir/am_nnet0/lang /home/thomas/tmpdir/am_trisa0 /home/thomas/tmpdir/am_nnet0/recipe/exp/nnet" returned with 1

abkhazia/utils/jobs.py:73: RuntimeError
---------------------------------------------------------- Captured stdout setup ----------------------------------------------------------
asking 20 cores but reduced to 3
preparing lexicon in /home/thomas/tmpdir/am_nnet0/lang (L.fst)...
running "/home/mbernard/dev/abkhazia/kaldi/egs/wsj/s5/utils/prepare_lang.sh --position-dependent-phones false --sil-prob 0.5 /home/thomas/tmpdir/am_nnet0/lang/recipe/data/local/dict "<unk>" /home/thomas/tmpdir/am_nnet0/lang/local /home/thomas/tmpdir/am_nnet0/lang"
training neural network
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================== 30 passed, 1 error in 763.04 seconds ===================================================

Can somebody reproduce this ?

mmmaat commented 7 years ago

Hi Thomas,

I didn't reproduce your bug, for me all is ok. You are using --basetemp in your home but the partition is almost full, can you try in your scratch ?

Thomas-Schatz commented 7 years ago

Setting --basetmp in my scratch worked!

bootphon / abkhazia

Kaldi tests not passing on a system with many cores #2