009-Personal-Alexa-like-Speech-Service / 009---Personal-Alexa-like-Speech-Service

BAA Projekt
0 stars 1 forks source link

009 - Personal Alexa-like Speech Service :loudspeaker:

Implement an Alexa alike speech service in Python "Natural Language Processing"

The project is related to Business Analytics A class with the M.Sc. Business Analytics studies at HSD University of Applied Science, Düsseldorf supported by Mr. Zeutschler.


Table of contents

  1. Business Understanding

    1.1 Determine Business Objectives

    1.2 Asses Situation

    1.3 Determine Data Mining Goals

  2. Methological Approach
  3. Findings and Achievements

    3.1 Problems

    3.2 Achievements

  4. Summary
  5. Potential Future Developments

1) Business Understanding 🧠

Use Case

Example: program a Speech Regognition

To start very easy we wrote down the following short use case:

Target :memo:

Our goal is to develop a voice control system that responds to the command "Hello Hal". After a user starts the speech assistant he should start listening and process the spoked words. Furthermore, our speech assistant Hal should be able to give a spoken answer. It should contain the follwing features which are mentioned below and give action to that. Moreover, we want to create a complete Github repository with our code and a detailed description about our project. The project started April 2021 und will be finished in August 2021. It was carried out as a part of our masters degress "Business Analytics".

Features

We want to implement the features in our speech assistant listed below:

Now we designed a detailed process which shows how we imagines the process to work:

image

Flowchart.pdf

Process Flow Chart

Below we designed a flow chart how our speech recognition works in three steps:

Bildschirmfoto 2021-06-03 um 18 31 40

The flow chart already shows the three challenges we had:

Crisp-DM :arrows_counterclockwise:

We chose the Crips-DM methology (Cross-Industry-Process for Data Mining) as an approach for our project. CrispDM ensures a structured approach and can be used for any projects related to Data Science.

On the following illustration the typical phases of the model are shown. It is not a linear process and moving back and forth between different phases as it is always required. The arrows in the following illustration show the most important and frequent dependencies between phases and the outer circle symbolizes the cyclic nature of data mining itself.

Bildschirmfoto 2021-08-13 um 09 02 47

1. Business understanding

In the first step, we defined our goals, requirements and potential success factors for speech recognition. We considered what must be possible in such the application so that it works well for the user and we achieve a satisfactory result in the end.

2. Data understanding

We thought about the data we will use and realised that the core of speech recognition is the audio files that are created during the input. We also looked at how audio files work technically and what additional libraries are needed to work with them.

3. Data Preparation

In this step we made sure that the input data is usable for our tool. A speech input is converted into a text which can then be processed. Again, it was necessary to find out which libraries are needed for this.

4. Modelling

Different recognizers can be used to process the input data, divided into different tasks. For this we used object-oriented programming.

5. Evaluation

In this phase we compared the current state of our work with the initially defined demands and goals. We evaluated how successful the project was, what was achieved and what was not and figured out future developments (chapter 5).

6. Deployment

In the last step we planned the deployment.


Specify Business Understanding (CRISP-DM)

After we shortly explained the CRISP-DM model in general, we want to go more into detail regarding our project:

1.1 Determine Business Objectives

Background

→ The basic basic steps for how speech recognition technology works are as follow:

Business Objectives

Business Success Criteria

1.2 Assess Situation

Inventory of Resources, Requirements, Assumptions and Constraint → Here are the different models used to build a speech recognition system:

Risks and Contingencies Terminology. Costs and Benefits

→ Word-error rate has its limitations, though. The data is affected by factors like:

1.3 Determine Data Mining Goals

Data Mining Goals

Data Mining Success Criteria

Project Organization

Since we are a team of three students who work together remotely a strucutred project organization was very important for us. We worked with a Kanban Board which is shown in the illustration below. This simplified our communication a lot.

Bildschirmfoto 2021-08-11 um 11 55 47

We structured the Kanban Board in four phases:

How Speech Recognition Works – An Overview

To get a complete overview about the business problem it was necessary for us to understand how the speech recognition process itself works.

To show the speech recognition process in an easy way we added the figure below.

Bildschirmfoto 2021-07-22 um 14 50 18 (Source:https://towardsdatascience.com/speech-recognition-in-python-the-complete-beginners-guide-de1dd7f00726)

Now we want to give a detailed description of the process:

The first component of speech recognition is speech. Speech must be converted from physical sound to an electrical signal with a microphone, and then to digital data with an analog-to-digital converter. Once the speech is digitized, several models can be used to transcribe the audio to text. Most modern speech recognition systems rely on Hidden Markov Model (HMM). This approach works on the assumption that a speech signal, when viewed on a short enough timescale, can be reasonably approximated as a stationary process—that is, a process in which statistical properties do not change over time.

One can imagine that this whole process may be computationally expensive. In many modern speech recognition systems, neural networks are used to simplify the speech signal using techniques for feature transformation and dimensionality reduction before HMM recognition. Voice activity detectors (VADs) are also used to reduce an audio signal to only the portions that are likely to contain speech. This prevents the recognizer from wasting time analyzing unnecessary parts of the signal.

While programming, we don't have to worry about that speech recognition process. There are several speech services/ packages available to help us with that. But we will explained the used packages later on. (Source: https://realpython.com/python-speech-recognition/)

Some of the factors that make programming a speech recognition more difficult are

2) Methological approach :file_folder:

In this chapter we want to give an overview about our methological approach. The first step is to install all necessary libraries.

NumPy (Numerical Python) is an open source Python library used in most scientific and technical fields. It is the standard for working with numerical data in Python. It is used to perform mathematical operations on arrays, such as trigonometric, algebraic and statistical routines. The library contains a lot of mathematical, algebraic and transformation functions.

Google has a speech recognition API. This API converts spoken text (microphone) to written text (Python string), called Speech to Text. You can simply speak into the microphone and the Google API will translate it into written text. This program takes the audio from your microphone, sends it to the Speech API, and returns a Python string.

The audio is recorded using a speech recognition module, which is included in the program above. Next, we send the recorded speech to the Google Speech Recognition API and return the output.

Pyaudio is a Python link for PortAudio, a cross-platform audio input and output library. This basically means that we can use Pyaudio to record and play audio across all platforms and operating systems, such as Windows, Mac and Linux. We added the illustration below to make the process clearer.

grafik

spaCy is a huge library with many functions and it very important for our speech assistant to "understand" our spoken words. Below we list a few of the functions as an overview. It is a open source library for Natural Language Processing (NLP) in Python. Natural Language Processing captures natural language and texts and processes them with the help of algorithms and other rules.

image

The goal of Natural Language Processing is to make language and texts understandable for computers in order to operate or control them by speech. To extract meaning from speech or texts, it is necessary to understand not only individual words, but also entire sentences, contexts or topics.

Natural Language Processing starts with tokenisation. In this step, the text is divided into tokens. Tokens are words, spaces or punctuation marks. There are models with their own tokenisation rules for each language.

Part-of-speech (POS) tagging assigns grammatical properties such as verb, adjective, noun or adverb to the words.

Another step is lemmatisation. Here, the individual words are traced back to their basic forms.

With the help of Named Entity Recognition, it is possible to assign persons, places, times or other objects like company names to the recognized entities. Dependency parsing assigns syntax dependencies to the identified and tagged tokens. Word vectors are used to describe and recognize relationships between words.

image

Is a text-to-speech conversion library used in Python. It works offline and is compatible with Python 2 and 3. In our case it is used to make the computer talk to us.

Defined classes

In the following, we wrote a description of which classes are implemented and what functions they contain. We wanted to create an appropriate GitHub repository with our code and a detailed description.

Class: Main (file main.py) We start our speech assistant Hal in our main-file.

Class: Hal (file hal.py) Hal is our speech assistant :older_man:

Process of classes On the illustration below the process of how the classes are connected with each other is shown:

image

Process of the classes.pdf



3) Findings and achievements :construction:

In this chapter we want to give a detailed description of our approach, work, findings and concrete achievements.

:warning: First we will start with problems we had during our project and how we solved these problems :warning:

1. Problems

1.1 Installation

There was a problem to install the package "PyAudio". We found two different solutions for Mac and Windows to solve the shown error.

1.)For Windows type the following command into the Pycharm console:

After the pip command is installed, the installation of PyAudio should work as well.

2.) For Mac type the following command into the Pycharm console:

or

3.) There was a problem to install the package "Spacy". We found a solution for Mac and Windows. Type the following code in your terminal:

1.2 Manual push request

Sometimes we had troubles with pushing our code into Github. This is how we solved the problem:

1.) Type in your terminal:

image

1.3 Common errors

2. Achievements

In this part we want to give a detailed description how the speech recogntion is used and what features we implemented.

This was the speech recognition process itself. Now we want to point out our biggest successes during the project a bit more.

4) Summary :construction:

All in all it can be said that the task to develop a speech recognition service object oriented was successfully completed. Due to the lack of time we were not able to implement all the features we wanted to add to the speech recognition. We will talk about that in topic 5 "potential future developments". Furthermore we had a few challenges to face. The hardest ones for us were to record voices and to make our speech assistant to speaking. Our personal success of this project was to get a much better understanding of object oriented programing and to collect valuable experience for our jobs. We got more used to developer tools such as "GitHub" and working with a Kanban board. When we first started the project none of us had any knowledge about object-oriented programming or Git. So nevertheless we were not able to implement all the features we wanted to, the project was a personal success for us.

5) Potential future developments :rocket:

We started to interpret easy commands like simple questions (e. g. "How are you Hal?"). But as already said before we could not implement all the features we set as a target in chapter 1. Now we will summarize shorty the missing features:

spaCy

As we already explained before spaCy has a wide range of possiilities for NLP and we just used a few of the in our project:

Implement further libraries

Moreover speech recognition assistants are a very important topic in general. We did some research how far the speech recognition technology is in general and considered this cases as very interesting. That's why we also want to share it with you:

Voice-Tech (Healthcare): AI-powered chatbots and virtual assistants played a vital role in the fight against COVID-19. So chatbots can help screen and triage patients. Voice and conversational AI have made health services more accessible to everyone who was unable to leave their home during COVID-19 restrictions. Now that patients have a taste for what is possible with voice and healthcare, behaviors are not likely to go back to re-pandemic norms.

Voice Cloning: Machine learning tech and GPU power development help to create a custom voice and make speech more emotional, which makes this computer-generated voice indistinguishable from the real one. You just use a recorded speech and then a voice conversion technology transforms your voice into another.