KarelVesely84 / kaldi-io-for-python

Python functions for reading kaldi data formats. Useful for rapid prototyping with python.
Apache License 2.0
376 stars 119 forks source link

Writing features as 'ark,scp' by pipeline with 'copy-feats' #45

Open faber6911 opened 4 years ago

faber6911 commented 4 years ago

Hi, I can't correctly execute the example you provided to write the .ark and .scp files at the same time. error_kaldi_io If instead I create the ark file and use copy-feats to create its copy and the attached .scp file, I don't encounter any problems.

jensen199105 commented 3 years ago

i wanna ask the the same thing, here is my code snippet and errors:

`import numpy as np from kaldiio import ReadHelper import kaldi_io

ark_scp_output = 'ark:| copy-feats ark:- ark,scp:ark:/home/jensen/Document/feats.ark,scp:/home/jensen/Document/feats.scp'

with kaldi_io.open_or_fd(ark_scp_output, 'w') as f: dic = {} for i in range(10): arr = np.random.randn(200, 10) dic[str(i)] = arr for k,v in dic.items(): kaldi_io.write_mat(f, v, k)` copy-feats ark:- ark,scp:ark:/home/jensen/Document/feats.ark,scp:/home/jensen/Document/feats.scp WARNING (copy-feats[5.5.839~1-0c6a]:Open():util/kaldi-table-inl.h:1311) When writing to both archive and script, the script file will generally not be interpreted correctly unless the archive is an actual file: wspecifier = ark,scp:ark:/home/jensen/Document/feats.ark,scp:/home/jensen/Document/feats.scp WARNING (copy-feats[5.5.839~1-0c6a]:Open():kaldi-io.cc:729) Invalid output filename format ark:/home/jensen/Document/feats.ark ERROR (copy-feats[5.5.839~1-0c6a]:TableWriter():util/kaldi-table-inl.h:1469) Failed to open table for writing with wspecifier: ark,scp:ark:/home/jensen/Document/feats.ark,scp:/home/jensen/Document/feats.scp: errno (in case it's relevant) is: Success

[ Stack-Trace: ] copy-feats(kaldi::MessageLogger::LogMessage() const+0x77b) [0x561ab18e1275] copy-feats(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x25) [0x561ab1852001] copy-feats(kaldi::TableWriter<kaldi::KaldiObjectHolder<kaldi::MatrixBase > >::TableWriter(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)+0xee) [0x561ab1861ba6] copy-feats(main+0x4c9) [0x561ab184ff92] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f82d5ae10b3] copy-feats(_start+0x2e) [0x561ab184fa0e]

kaldi::KaldiFatalErrorException in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/jensen/.local/lib/python3.8/site-packages/kaldi_io/kaldi_io.py", line 97, in cleanup raise SubprocessFailed('cmd %s returned %d !' % (cmd,ret)) kaldi_io.kaldi_io.SubprocessFailed: cmd copy-feats ark:- ark,scp:ark:/home/jensen/Document/feats.ark,scp:/home/jensen/Document/feats.scp returned 255 ! Traceback (most recent call last): File "test.py", line 13, in kaldi_io.write_mat(f, v, k) File "/home/jensen/.local/lib/python3.8/site-packages/kaldi_io/kaldi_io.py", line 554, in write_mat fd.write(m.tobytes()) BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "test.py", line 13, in kaldi_io.write_mat(f, v, k) BrokenPipeError: [Errno 32] Broken pipe

faber6911 commented 3 years ago

@jensen199105 I solved it by importing WriteHelper, in particular using "from kaldiio import WriteHelper" and using a script like this:

abs_path = args.abs_path
ark_path_train = os.path.join(abs_path, 'data/train/train.ark')
scp_path_train = os.path.join(abs_path, 'data/train/train.scp')
ark_path_test = os.path.join(abs_path, 'data/test/test.ark')
scp_path_test = os.path.join(abs_path, 'data/test/test.scp')

start = time.time()

if not os.path.isfile(ark_path_train):

    writer = WriteHelper('ark,scp:{},{}'.format(ark_path_train, scp_path_train), compression_method=compression_method)

    noise_choice = {'music':659, 'noise':929, 'speech':425}

    for count, line in enumerate(open('../data/train/wav.scp')):
        # clean audio path
        utt, path = line.rstrip().split()
        # clean audio file
        clean_audio, _ = librosa.load(path, sr = sample_rate)
        # now for every noise type we augment n times the clean audio file using random noise audio files
        for noise_type in noise_choice:
            for aug in range(train_augmentation):
                noise_track = np.random.randint(0, noise_choice[noise_type])
                _, noise_path = open('../data/musan_{}.scp'.format(noise_type)).readlines()[noise_track].rstrip().split()
                noise_audio, _ = librosa.load(noise_path, sr = sample_rate)
                noisy_audio = add_noise(clean_audio, noise_audio, snr=random.choice([2.5, 7.5, 12.5, 17.5]))
                # write ark and associated scp file in train directory
                writer(utt,np.concatenate((clean_audio.reshape(1, -1), noisy_audio.reshape(1, -1))))

Using this strategy you will be able to create the ark file and the associated scp file.