ESIPFed / gsoc

Project ideas and mentor guidance for ESIP members to participate in Google Summer of Code.
Apache License 2.0
34 stars 16 forks source link

OrcaCNN: Detecting and classifying killer whales from acoustic data #17

Closed yosoyjay closed 4 years ago

yosoyjay commented 5 years ago

ESIP Member Organization

Alaska Ocean Observing System (AOOS) and Axiom Data Science

Mentors

Jesse Lopez, Dan Olsen

Project Idea

Build, compare, and analyze neural network models built to detect Killer Whales from passive acoustic data.

Information for students

Just the generally information for students. See ESIP Student Guide

Abstract

Killer whales, or orca, inhabit the worlds oceans, but details about local populations is difficult to discern due to a lack of data. The recent increasing deployment of hydrophones, or underwater microphones, has provided researchers with raw data to study killer whale populations. Unfortunately, the current tools available to automatically identify killer whales still requires substantial effort by requiring manual verification and do not provide any detail about the particular pod, or killer whale group, that was detected. This project aims first to aid killer whale researchers by developing a custom Convolution Neural Network classifier that will automatically identify killer whales in a passive acoustic dataset from Alaska. The second, more ambitious goal, is to extend the model to detect the presence of killer whales and identify the particular pod.

Technical stuff

Python: TensorFlow, PyData (numpy, scipy, sklearn, matplotlib, et al.)

Helpful Experience, but not required!

Python, machine learning, acoustics, data analysis and visualization

First steps

  1. Read https://ai.googleblog.com/2018/10/acoustic-detection-of-humpback-whales.html

  2. Read https://github.com/jaimeps/whale-sound-classification

  3. Explore the data that will be used for this project and work through tutorials for training CNN on audio data

  4. Project repo

yosoyjay commented 5 years ago

@sainimohit23 Very nicely done, being able to accurately identify when the calls start and end is critical for the project. How does the method respond if there are multiple calls in very close proximity? Not a great example, but this sample has two back-to-back calls. Does you're method identify them as a single call or two?

Also, please use Markdown to post code and results. It's much easier to read, parse, and search for specific text and results if it's not embedded in an image. Thanks!

yosoyjay commented 5 years ago

@kunakl07 I like where this is heading, it does look like your model has successfully identified calls in the long samples - very good! So a next step in your model building would be to separate and identify all of the calls made in those long samples. Given that, what do you think the next steps would take?

Also, as I mentioned in the comment above, please post your code and results in Markdown. All you have to do is cut and paste the results from the notebook instead of creating and attaching a screenshot. Thanks!

yosoyjay commented 5 years ago

Hi @Ahmed-Moselhy! Thanks for your interest in this project. Please have a look at the First Steps at the top of the issue. The posts from other folks interested in the project may also help and give you a sense of the type work associated with the project. Good luck!

sainimohit23 commented 5 years ago

@yosoyjay I just created an example with multiple calls. Here is the audio file . Results:

Start Time : 1.5869565217391304 Start Time : 2.8840579710144927 Start Time : 4.804347826086957 Start Time : 6.108695652173913

I have also thought of a pipeline for both detecting and classifying the orca calls. I would like to discuss it with you via e-mail/linkedIn or any other means. If you want I can share(only with you) the code of the stuff that I have done so far.

kunakl07 commented 5 years ago

@yosoyjay ,Sir I have identified all the calls seperately,in a single long sample. The 32 second long sample is divided into fixed sized chunks of 4 seconds and saved . Then our model predicts on these smaller sized chunks,which type of calls they contain(weeya,double_weeya,etc).

After increasing the number of epochs to 8000,and increasing feature dimensions and adding a few more layers of CNNs,the model has predicted correctly 19 calls out of 20,even on augmented long sample which contains noise and mixed sounds

Sir,the next task for me would be to recognize particular pod killer whale,and identify the type of calls accurately even when multiple calls of different pods,occur back to back within a very short duration of time.

Okay Sir,from next time I will post my code and result in Markdown

sainimohit23 commented 5 years ago

Hi, @yosoyjay ,

I have made a model to classify the data that you have provided. In data preprocessing I first standardized the sampling rate and channels. Then I created dataset for train, test and validation with 150, 30 and 30 samples(negative included) respectively. All the samples have length of 1 second. Used the data to train a simple 4 layered model for 20 epochs and here are the results.

Train on 150 samples, validate on 30 samples Epoch 1/20 150/150 [==============================] - 120s 797ms/step - loss: 10.5774 - acc: 0.3133 - val_loss: 6.4932 - val_acc: 0.3333 Epoch 2/20 150/150 [==============================] - 95s 630ms/step - loss: 8.2801 - acc: 0.4867 - val_loss: 6.5730 - val_acc: 0.5667 Epoch 3/20 150/150 [==============================] - 101s 672ms/step - loss: 7.2705 - acc: 0.4867 - val_loss: 2.1130 - val_acc: 0.5667 Epoch 4/20 150/150 [==============================] - 104s 692ms/step - loss: 1.6563 - acc: 0.5000 - val_loss: 0.8586 - val_acc: 0.6667 Epoch 5/20 150/150 [==============================] - 99s 660ms/step - loss: 1.0591 - acc: 0.6133 - val_loss: 0.8566 - val_acc: 0.7000 Epoch 6/20 150/150 [==============================] - 93s 623ms/step - loss: 0.7327 - acc: 0.7600 - val_loss: 0.3979 - val_acc: 0.9000 Epoch 7/20 150/150 [==============================] - 89s 590ms/step - loss: 0.5620 - acc: 0.8200 - val_loss: 0.6417 - val_acc: 0.8000 Epoch 8/20 150/150 [==============================] - 88s 586ms/step - loss: 0.4399 - acc: 0.8867 - val_loss: 0.2644 - val_acc: 0.9667 Epoch 9/20 150/150 [==============================] - 88s 584ms/step - loss: 0.3267 - acc: 0.8933 - val_loss: 0.3650 - val_acc: 0.9000 Epoch 10/20 150/150 [==============================] - 88s 584ms/step - loss: 0.3424 - acc: 0.9067 - val_loss: 0.2888 - val_acc: 0.9667 Epoch 11/20 150/150 [==============================] - 88s 588ms/step - loss: 0.3178 - acc: 0.9267 - val_loss: 0.3092 - val_acc: 0.9333 Epoch 12/20 150/150 [==============================] - 89s 591ms/step - loss: 0.1793 - acc: 0.9467 - val_loss: 0.2859 - val_acc: 0.9667 Epoch 13/20 150/150 [==============================] - 88s 587ms/step - loss: 0.1502 - acc: 0.9467 - val_loss: 0.2382 - val_acc: 0.9667 Epoch 14/20 150/150 [==============================] - 88s 588ms/step - loss: 0.1626 - acc: 0.9400 - val_loss: 0.2561 - val_acc: 0.9333 Epoch 15/20 150/150 [==============================] - 89s 596ms/step - loss: 0.2501 - acc: 0.9400 - val_loss: 0.8454 - val_acc: 0.9333 Epoch 16/20 150/150 [==============================] - 88s 589ms/step - loss: 0.1180 - acc: 0.9600 - val_loss: 0.5235 - val_acc: 0.8667 Epoch 17/20 150/150 [==============================] - 90s 601ms/step - loss: 0.0588 - acc: 0.9800 - val_loss: 0.4427 - val_acc: 0.9000 Epoch 18/20 150/150 [==============================] - 88s 589ms/step - loss: 0.1057 - acc: 0.9533 - val_loss: 0.2725 - val_acc: 0.9667 Epoch 19/20 150/150 [==============================] - 91s 608ms/step - loss: 0.0214 - acc: 1.0000 - val_loss: 0.3444 - val_acc: 0.9333 Epoch 20/20 150/150 [==============================] - 88s 586ms/step - loss: 0.0594 - acc: 0.9800 - val_loss: 0.8254 - val_acc: 0.9000

Test set results:

30/30 [==============================] - 1s 50ms/step Loss : 0.12645812332630157 Accuracy : 0.9666666388511658

Earlier I prepared a dataset of 2 second clips by overlaying clips shorter than 2 seconds on some backgrounds. As, it turned out model was having hard time training on that data due to difference between frequencies of backgrounds and overlay audio. So, I switched to 1 sec audio clips and I had to drop "double weeya"(3 second).

Screenshot_63

Spectrogram of one of the artificially generated samples with 2 sec. length.

kunakl07 commented 5 years ago

Hi @sainimohit23, My model takes input dataset of 3 seconds audio-clips and even longer,and it predicts correct class of the calls in the long samples with overlayed background noise pretty well. Therefore,I think you should initially preform template matching and include only the lower range of the frequencies of the spectrogram(0-200Hz), since the whale calls are commonly located in this area of the plot,and then calculate centroid of frequency bins tem

Now,enhance the contrast of spectogram by caping the extreme values,so that you can extract subtle details would not loose these details.

cap

Now,try to reduce false positives,frequencies in the 200 - 400 Hz range, which is above the frequency range for right whale calls by croping the constrast enhanced image to eliminate frequency below 200Hz. Now,I think you could train audio samples greater than 3 seconds pretty well!!

sainimohit23 commented 5 years ago

@kunakl07 thanks I'll try that. But, don't you think 8000 epochs are too much for a small dataset like that?

kunakl07 commented 5 years ago

@sainimohit23,I performed data augmentation and increased my dataset to a huge amount This was done by adding noise,performing shifting and stretching operations. I also mixed background noise and performed speed tuning and volume tuning. Here's the code,which you can use to perform data augmentation on your audio clips.

Hi @yosoyjay , After combining the sound of a ship's horn(that we can hear near the coast) and a long sample wave,I got the following audio result We can also perform various augmentations like adding noise,stretching and shifting the the sound file. I have given the code below so that @ZER-0-NE @sainimohit23 can increase your dataset and use it to make your model more accurate.

import librosa
import numpy as np
import matplotlib.pyplot as plt

class AudioAugmentation:
    def read_audio_file(self, file_path):
        input_length = 160000
        data = librosa.core.load(file_path)[0]
        if len(data) > input_length:
            data = data[:input_length]
        else:
            data = np.pad(data, (0, max(0, input_length - len(data))), "constant")
        return data

    def write_audio_file(self, file, data, sample_rate=16000):
        librosa.output.write_wav(file, data, sample_rate)

    def plot_time_series(self, data):
        fig = plt.figure(figsize=(8, 8))
        plt.title('Raw wave ')
        plt.ylabel('Amplitude')
        plt.plot(np.linspace(0, 1, len(data)), data)
        plt.show()

    def add_noise(self, data):
        noise = np.random.randn(len(data))
        data_noise = data + 0.005 * noise
        return data_noise

    def shift(self, data):
        return np.roll(data, 1600)

    def stretch(self, data, rate=1):
        input_length = 160000
        data = librosa.effects.time_stretch(data, rate)
        if len(data) > input_length:
            data = data[:input_length]
        else:
            data = np.pad(data, (0, max(0, input_length - len(data))), "constant")
        return data

aa = AudioAugmentation()

data = aa.read_audio_file("OrcaCNN-data/data/long_samples/long_sample_01.wav")
aa.plot_time_series(data)

data_noise = aa.add_noise(data)
aa.plot_time_series(data_noise)

data_roll = aa.shift(data)
aa.plot_time_series(data_roll)

data_stretch = aa.stretch(data, 0.8)
aa.plot_time_series(data_stretch)

aa.write_audio_file('OrcaCNN-data/data/long_samples/generated_lsw001.wav', data_noise)
aa.write_audio_file('OrcaCNN-data/data/long_samples/generated_lsw002.wav', data_roll)
aa.write_audio_file('OrcaCNN-data/data/long_samples/generated_lsw003.wav', data_stretch)

You can also generate graphs to know the difference between the real and augmented sound sample. This is the code for overlaying and merging different sounds.

from pydub import AudioSegment

sound1 = AudioSegment.from_file("OrcaCNN-data/data/long_samples/long_sample_01.wav")
sound2 = AudioSegment.from_file("boathorn.wav")

combined = sound1.overlay(sound2)

combined.export("OrcaCNN-data/data/mixed_001.wav", format='wav')

The graph of mixed_audio would look like this

bandicam 2019-03-11 13-57-40-177

While performing predictions on the mixed sound,the long sample was divided into equal chunks of 7 seconds each and the model ignored the ship-horn and only detected the calls in the mixed sample as positive(You can refer that code from previous comment). Here,the sound of horn is predicted negative. bandicam 2019-03-11 14-06-52-662 Here,we can hear the calls and also our model has predicted it to be positive. bandicam 2019-03-11 14-05-14-850 So,even after augmenting and merging other sounds this model is able to divide the into segments and predict them correctly

Apart from this,I have also used Conditional GANS to increase my dataset,which proved to be pretty useful in augmenting my dataset.

ZER-0-NE commented 5 years ago

Therefore,I think you should initially preform template matching and include only the lower range of the frequencies of the spectrogram(0-200Hz), since the whale calls are commonly located in this area of the plot,and then calculate centroid of frequency bins

Hi @kunakl07 You are right in your thinking, but I guess the range for right whale up-sweeping calls belong to the range (100Hz-200Hz). Interesting of you to use Conditional GAN's. I wanted to know why you think it would work out well in this case? And regarding your dataset, how much is your size of dataset after applying data augmentation?

kunakl07 commented 5 years ago

@ZER-0-NE , The reason I've used cGANS is because it squeezes the most out of the small data. Here we have 2 networks,one Generator(tries to enhance the input noisy spectrogram) and one Discriminator(tries to distinguish between enhanced spectrograms) which would classify real and fake spectrogram ,and give feedback to generator,i.e if its weeya and its correctly classified as 'weeya' GANS would be able to produce more number of 'weeya' by changing parameters. Similarly,spectograms of other classes can also be generated. Thus,even you can use GANS to generate more number of audio calls from small dataset . We can also change the variability,fidelity. Apart from this you can vary inception_score,to generate various audio_clips of a particular class. Here's the code for inception_score that you can us

def inception_score(
    audio_fps,
    k,
    metagraph_fp,
    ckpt_fp,
    batch_size=100,
    tf_ffmpeg_ext=None,
    fix_length=False):
  use_tf_ffmpeg = tf_ffmpeg_ext is not None
  if not use_tf_ffmpeg:
    from scipy.io.wavfile import read as wavread

  if len(audio_fps) % k != 0:
    raise Exception('audio file ({}) not divisible  ({})'.format(len(audio_fps), k))
  group_size = len(audio_fps) // k

  graph = tf.Graph()
  with graph.as_default():
    saver = tf.train.import_meta_graph(metagraph_fp)

    if use_tf_ffmpeg:
      x_fp = tf.placeholder(tf.string, [])
      x_bin = tf.read_file(x_fp)
      x_samps = tf.contrib.ffmpeg.decode_audio(x_bin, tf_ffmpeg_ext, 16000, 1)[:, 0]
  x = graph.get_tensor_by_name('x:0')
  scores = graph.get_tensor_by_name('scores:0')

  sess = tf.Session(graph=graph)
  saver.restore(sess, ckpt_fp)

  _all_scores = []
  for i in xrange(0, len(audio_fps), batch_size):
    batch = audio_fps[i:i+batch_size]

    _xs = []
    for audio_fp in batch:
      if use_tf_ffmpeg:
        _x = sess.run(x_samps, {x_fp: audio_fp})
      else:
        fs, _x = wavread(audio_fp)
        if fs != 16000:
          raise Exception('Invalid sample rate ({})'.format(fs))
        if _x.dtype==np.int16:
            _x = _x.astype(np.float32)
            _x /= 32767.

      if _x.ndim != 1:
        raise Exception('Invalid shape ({})'.format(_x.shape))

      if fix_length:
        _x = _x[:16384]
        #_x = _x[-16384:]
        _x = np.pad(_x, (0, 16384 - _x.shape[0]), 'constant')

      if _x.shape[0] != 16384:
        raise Exception('Invalid number of samples ({})'.format(_x.shape[0]))

      _xs.append(_x)

    _all_scores.append(sess.run(scores, {x: _xs}))

  sess.close()

  _all_scores = np.concatenate(_all_scores, axis=0)
  _all_labels = np.argmax(_all_scores, axis=1)

  _inception_scores = []
  for i in xrange(k):
    _group = _all_scores[i * group_size:(i + 1) * group_size]
    _kl = _group * (np.log(_group) - np.log(np.expand_dims(np.mean(_group, 0), 0)))
    _kl = np.mean(np.sum(_kl, 1))
    _inception_scores.append(np.exp(_kl))

  return np.mean(_inception_scores), np.std(_inception_scores), _all_labels

Here's how you can calculate feats and vary them

x = tf.placeholder(tf.float32, [None])
  x_trim = x[:16384]
  x_trim = tf.pad(x_trim, [[0, 16384 - tf.shape(x_trim)[0]]])
  X = tf.contrib.signal.stft(x_trim, 2048, 128, pad_end=True)
  X_mag = tf.abs(X)
  W_mel = tf.contrib.signal.linear_to_mel_weight_matrix(
      num_mel_bins=128,
      num_spectrogram_bins=1025,
      sample_rate=16000,
      lower_edge_hertz=40.,
      upper_edge_hertz=7800.,
  )
  X_mel = tf.matmul(X_mag, W_mel)
  X_lmel = tf.log(X_mel + 1e-6)
  X_feat = X_lmel

  with tf.Session() as sess:
    _X_feats = []
    for wav_fp in tqdm(wav_fps):
      _, _x = wavread(wav_fp)

      _X_feats.append(sess.run(X_feat, {x: _x}))
    _X_feats = np.array(_X_feats)

Apart from this,I have also preformed similarity using nearest-neighbors,added noise envelope. If you want code for any of them,I would be happy to help

yosoyjay commented 5 years ago

@kunakl07 @sainimohit23 @ZER-0-NE

Hi all, very nice progress and work thus far. I like that you have all approached this is slightly different ways and appreciate the fact that you have all been open to sharing your approaches to the problem.

To provide a bit of additional context for actually applying for the program, I'm providing a bit of detail for what is needed for the project. This should help with the application:

We need:

  1. Data pre-processing infrastructure to get the input data and put it in a standard format including, but not limited to: file length, audio quality attributes, metadata, etc. This will help with the development of the model.
  2. A model that can identify orca calls and which pod they are from and perhaps the specific signal. This may be two models, the first can be used to find where the orca calls are in the files and the second identifies the pod and signal. Design will be dependent on results.
  3. A web app where folks can upload the data and identify if, and where, orca calls were found.

Parts 1 and 2 will constitute a pipeline that should have a REST API. That will allow for both programmatic access and for independent development of the front-end code.

The emphasis is on parts 1) and 2), but making at least a basic, but functional front-end for 3) is the goal.

ZER-0-NE commented 5 years ago

@kunakl07 It's an impressive idea indeed to use cGAN for improving the data collection. I believe as long as mode collapse does not set in(as I'm assuming it didn't here) for the generator, it would be a useful addition to use adversarial networks.

yosoyjay commented 5 years ago

Hi all, I've started filling out the OrcaCNN repo where the work for the project will take place.

I'll work on fleshing out the idea of the project there by opening issues and such. This may be helpful to guide applications.

kunakl07 commented 5 years ago

That's great sir!

llucifer97 commented 5 years ago

Hi ,I am Ayush Raj , student at Birla Institute of Technology, Mesra. I am very much interested in this project. Please can someone help me getting started. I have gone through the blogs given above and cloned the repository of dataset. what next, what i am supposed to do?

kunakl07 commented 5 years ago

Hi @llucifer97 , Initially,you need to detect Ocra calls in your audio file.You can do this by building a model,giving training dataset provided by @yosoyjay as input and successfully detect audio calls in long samples,if present. Once you get an idea what the problem looks like and how to work with spectograms to detect calls,you can preform the above steps given by @yosoyjay

yosoyjay commented 5 years ago

@llucifer97 Hi Ayush, yeah as @kunakl07 mentioned, the way to get started is to go through the first steps outlined at the top of the issue to understand what the work entails at a high-level. I'd also suggest having a look at what other interested folks have done thus far - lots of interesting work!

esip-lab commented 5 years ago

Hi all - a friendly reminder that there is ONE WEEK LEFT to submit your proposals for this project! Best of luck and we're excited to see what is submitted!

sainimohit23 commented 5 years ago

Congratulations Abhishek

hdsingh commented 5 years ago

Congrats @ZER-0-NE !!

ZER-0-NE commented 5 years ago

Thank you @hdsingh @sainimohit23