afrancl / BinauralLocalizationCNN

Code to create networks that localize sounds sources in 3D environments
41 stars 12 forks source link

undefined variables #1

Closed bingo-todd closed 2 years ago

bingo-todd commented 2 years ago

Hi, I am trying to replicate your work, but I found two variables are not defined in "tf_record_CNN_spherical_gradcheckpoint_valid_pad.py", which are "subbands_batch" and "T"

afrancl commented 2 years ago

Thanks for your interest in the work and sorry for the trouble with this!

The code should still run as-is but please tell me if you're getting a specific error.

bingo-todd commented 2 years ago

sorry for the late relay.

the problem with "T" occurs in line 819: batch_conditional2 += [(cond, var) for cond, var in zip(cd2, e_vars, T)]

I got another question, "layer_generator" seems to be a script but it is not included in this repo

bingo-todd commented 2 years ago

I have managed to run evaluation related scripts. But using network architectures provided in the paper, I can only restore 7 networks. Error occurs when I try to restore network 3, 4, 9. For example, when I try to restore net 3, the following error occurs

Traceback (most recent call last):
  File "/mnt/Disk2/Work_Space/anaconda3/envs/py3.6_tf1.13/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/mnt/Disk2/Work_Space/anaconda3/envs/py3.6_tf1.13/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/mnt/Disk2/Work_Space/anaconda3/envs/py3.6_tf1.13/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [1,32,32,32] rhs shape= [3,32,32,32]
         [[{{node save/Assign_57}}]]

After changing the kernel shape of the second conv in net3 to [3, 32, 32], no error occurs. There seems a mismatch between network architectures provided in the paper and network weights.

Another question: in the paper, a filterbank consisting of 36 bandpass filters is used. However, in the code, waveforms are decomposed into 39 frequency bands. E.g. in tf_record_CNN_spherical_gradcheckpoint_valid_pad.py (from line 396 to line 406)

# Do not change parameters below unless altering network #
BKGD_SIZE = [78, 48000]
STIM_SIZE = [78, 89999]
# TONE_SIZE = [78, 59099]
# ITD_TONE_SIZE = [78, 39690]
if zero_padded:
    STIM_SIZE = [78, 48000]

if stacked_channel:
    STIM_SIZE = [39, 48000, 2]
    BKGD_SIZE = [39, 48000, 2]
Helmholz commented 2 years ago

Hi @bingo-todd we are also trying to replicate the work of the original author, it seens all networks are OK from our replications, no mismatch found yet. Regarding to the input dimension problem, we found the following code from the previous commits: https://github.com/afrancl/BinauralLocalizationCNN/commit/ece918cde5474d79aae153f970ce6aa40861b208 ` low_lim=30 hi_lim=20000 sr=48000 sample_factor=1 scale = 0.1 i=0 pad_factor = None

invert subbands

n = int(np.floor(erb.freq2erb(hi_lim) - erb.freq2erb(low_lim)) - 1)
sess.run(combined_iter.initializer)
subbands_test,az_label,elev_label = sess.run([combined_iter_dict[0]['train/image'],combined_iter_dict[0]['train/azim'],combined_iter_dict[0]['train/elev']])

filts, hz_cutoffs, freqs=erb.make_erb_cos_filters_nx(subbands_test.shape[2],sr, n,low_lim,hi_lim, sample_factor,pad_factor=pad_factor,full_filter=True)

filts_no_edges = filts[1:-1]
for batch_iter in range(3):
    for stim_iter in range(16):
        subbands_l=subbands_test[stim_iter,:,:,0]
        subbands_r=subbands_test[stim_iter,:,:,1]
        wavs = np.zeros([subbands_test.shape[2],2])
        wavs[:,0] = sb.collapse_subbands(subbands_l,filts_no_edges).astype(np.float32)
        wavs[:,1] = sb.collapse_subbands(subbands_r,filts_no_edges).astype(np.float32)
        max_val = wavs.max()
        rescaled_wav = wavs/max_val*scale
        name = "stim_{}_{}az_{}elev.wav".format(stim_iter+batch_iter*16,int(az_label[stim_iter])*5,int(elev_label[stim_iter])*5)
        name_with_path = newpath+'/'+name
        write(name_with_path,sr,rescaled_wav)
    pdb.set_trace()
    subbands_test,az_label,elev_label = sess.run([combined_iter_dict[0]['train/image'],combined_iter_dict[0]['train/azim'],combined_iter_dict[0]['train/elev']])

`

perhaps this could explain where the 39 comes from. Notice that the code n = int(np.floor(erb.freq2erb(hi_lim) - erb.freq2erb(low_lim)) - 1) will generate 39 from the parameter settings above. We have tried to inverse the cochleagram data back to wav using the above codes and ERB bands settings, and it seems work. #

bingo-todd commented 2 years ago

Hi @Helmholz, The mismatch I tried to report is between descriptions in paper and network weights. When I first downloaded the network weights, "config_array.npy" had not been included, so I used the network architectures described in the paper, and found mismatches. I have validated that there are mismatch between the paper and "config_array.npy".

As for the input dimension, you are right, the number of frequency bands is set to 39 in the code, but 36 is used in the paper.

afrancl commented 2 years ago

Hi both,

To bingo-todd's point, there were two convolutional filters in the supplementary table that had typos. We recommend using the config array files to load weights as that is what we did ourselves.