Real-time classification - main.py

dimstudio commented 4 years ago

In main.py single intervals (e.g. chest compressions or strokes) will be flowing-in one at a time via a TCP connection. This means that we will get smaller datafiles. We need to make sure the transformation on this data (in terms of rescaling, resampling, min-max normalization) is exactely the same as the one in model_training.py since the processed intervals will be classified with the learned models.

dimstudio commented 4 years ago

Can we save the scaler and make sure to use that also for the future instances of chest compression?

HansBambel commented 4 years ago

Can we save the scaler and make sure to use that also for the future instances of chest compression?

Good idea. I just added saving the scaler after fitting it. When needed it can be loaded again with scaler = joblib.load('models/scaler.pkl')

dimstudio commented 4 years ago

Thanks, added the scaler also at the Tensorflow implementation. Editing like this, to avoid overwriting:

dataset = folder.split("/")[1] joblib.dump(scaler, "models/scaler_" + dataset + ".pkl")

HansBambel commented 4 years ago

Added an example in #13

This does not work properly yet, because I don't exactly know how it works with the TCP.

dimstudio commented 4 years ago

Thanks that's cool. I am not able to simulate the TCP connection yet.

You should consider the INPUT data as one session file (zip file) having only ONE sample, so 1 chest compression.

The expected output that has to be returned is a dictionary with target Classes and the classification result

{
    "classRate": 2,
    "classDepth": 0,
    "classRelease": 1
}

Some questions:

Is it only available for PyTorch or is there one for TensorFlow?
Are there big differences between PyTorch and Tensorflow in terms of performance? which one seem to work best for you now?

HansBambel commented 4 years ago

I am only working on the PyTorch version, because PyTorch's documentation and also support is way better than Tensorflow's. Also I prefer the coding style of PyTorch since it is more Pythonic (Tensorflow 2.0 is trying to copy PyTorch by now, but it still is worse).

Performance wise they are basically the same. PyTorch is more used in Research areas right now and gets increasingly more popular which helps in troubleshooting. The PyTorch developers are also quite active on the forums which helps in finding solutions to your (mostly quite specific) problems.

HansBambel commented 4 years ago

You should consider the INPUT data as one session file (zip file) having only ONE sample, so 1 chest compression.

As far as I understand I do that.

The expected output that has to be returned is a dictionary with target Classes and the classification result { "classRate": 2, "classDepth": 0, "classRelease": 1 }

Alright, but then it is less general. We could add a rounding option after getting the model's prediction.

dimstudio commented 4 years ago

I am only working on the PyTorch version, because PyTorch's documentation and also support is way better than Tensorflow's. Also I prefer the coding style of PyTorch since it is more Pythonic (Tensorflow 2.0 is trying to copy PyTorch by now, but it still is worse).

Performance wise they are basically the same. PyTorch is more used in Research areas right now and gets increasingly more popular which helps in troubleshooting. The PyTorch developers are also quite active on the forums which helps in finding solutions to your (mostly quite specific) problems.

Alright, but I was asking what performances you get on the TableTennis and the CPR_experiment dataset with PyTorch and TF. From a first TF got higher prediction accuracy on these datasets.

dimstudio commented 4 years ago

Alright, but then it is less general. We could add a rounding option after getting the model's prediction.

Why less general? because it writes the target_classes or because it gives answers without confidence?

dimstudio commented 4 years ago

Added an example in #13

This does not work properly yet, because I don't exactly know how it works with the TCP.

In any case if I manage I try this afternoon it works with PyTorch!

HansBambel commented 4 years ago

Alright, but I was asking what performances you get on the TableTennis and the CPR_experiment dataset with PyTorch and TF. From a first TF got higher prediction accuracy on these datasets.

Aah, that's what you meant. Sorry, I misunderstood.

I got the same performances as you posted in #9

Why less general?

Because we would need to hardcode the class names in the dictionary.

In any case if I manage I try this afternoon it works with PyTorch!

Do that! It's really straight forward!

dimstudio commented 4 years ago

Because we would need to hardcode the class names in the dictionary.

But those are already hardcoded in target_classes = ["classRate", "classDepth", "classRelease"]

HansBambel commented 4 years ago

You are right! Fixed it

dimstudio commented 4 years ago

I am rewriting main.py which implements the TCP server. We need to change online_classification function accordingly and check which parameters we need

The function in main.py works more or less like this

def handle_client_connection(client_socket, port):
    request = client_socket.recv(10000000)
    json_string = json.loads(request, encoding='ascii') #here the chest compression enconded in json
    result_dict = online_classification(json_string) #here we need to change this function with the right parameters
    client_socket.send(result_dict.encode()) #this is the server reply with the results
    client_socket.close()

Please note:

in this case the input data will be in json format, I also wrote a json_to_df function:

# load the json parsed data
def json_to_df(data):
    df = pd.concat([pd.DataFrame(data),
                    json_normalize(data['Frames'])],
                   axis=1).drop('Frames', 1)
    df.columns = df.columns.str.replace("_", "")
    if not df.empty:
        df['frameStamp'] = pd.to_timedelta(df['frameStamp'])  # + start_script
        df.columns = df.columns.str.replace("frameAttributes", df["ApplicationName"].all())
        df = df.set_index('frameStamp').iloc[:, 2:]
        df = df[~df.index.duplicated(keep='first')]
        df = df.apply(lambda x: pd.to_numeric(x, errors='ignore'))
        df = df.select_dtypes(include=['float64', 'int64'])
        df = df.loc[:, (df.sum(axis=0) != 0)]
        # KINECT fix
        df.rename(columns=lambda x: re.sub('KinectReader.\d', 'KinectReader.', x), inplace=True)
        df.rename(columns=lambda x: re.sub('Kinect.\d', 'Kinect.', x), inplace=True)
        # Exclude irrelevant attributes
        for el in to_exclude:
            df = df[[col for col in df.columns if el not in col]]
        df = df.apply(pd.to_numeric).fillna(method='bfill')
    else:
        print('Empty data frame. Did you wear Myo?')
    return df

dimstudio commented 4 years ago

Thanks, added the scaler also at the Tensorflow implementation. Editing like this, to avoid overwriting:

dataset = folder.split("/")[1] joblib.dump(scaler, "models/scaler_" + dataset + ".pkl")

Can you please do the same in model_training_pytorch.py?

HansBambel commented 4 years ago

I prefer to add _scaler to the model path instead of putting it at the front. This keeps it more general (for example when a path to a model is in a subfolder) and keeps the scaler next to the belonging model. When you put scaler in front all the scalers will be grouped together.

dimstudio commented 4 years ago

Makes sense

dimstudio commented 4 years ago

Hey @HansBambel I need to make the online classification work by the end of this week. Do you think you can give a look at how to change the online_classification function so that it can take one learning sample as input? thank you

HansBambel commented 4 years ago

@dimstudio I'll look into it!

HansBambel commented 4 years ago

Done in PR #16

Online classification now takes a path to the trained model and an input sample. Maybe this way is very slow because for every sample pytorch is started. Maybe the loop should be done in online_classification() when the model is loaded.

I was assuming that the TCP-server gives me a correct input. I do not know how you do this though...

Another thing I have seen in main.py is that the function tensor_transform is there. Is this the same as in data_helper.py? It looks smaller.

dimstudio commented 4 years ago

Done in PR #16

Terrific thanks!

Another thing I have seen in main.py is that the function tensor_transform is there. Is this the same as in data_helper.py? It looks smaller.

It should be the same transformation yes. It looks smaller as it operates in just one interval, so it does not need to cut into intervals and do the preprocessing as the entire dataset.

Btw I am testing it right now I need to adapt it a bit. I keep you posted!

dimstudio commented 4 years ago

I have a problem with loading the model in the online_classification function in main.py

The torch.load gives me a dictionary with state_dict value. I have to initialize the model variable to something, this was my attempt but I cannot initialize MyLSTM without arguments.

    model = model_training_pytorch.MyLSTM()
    loaded = torch.load(f'{path_to_model}.pt')
    model = model.load_state_dict(loaded['state_dict'])
    model.eval()

HansBambel commented 4 years ago

This is what I have for loading there:

loaded = torch.load(f'{path_to_model}.pt')
model = loaded['model']
model.load_state_dict(loaded['state_dict'])
model.eval()

This is what I have there. When you train with model_training_pytorch.py he saves the model with the classes and the parameters as well (at least he should) so that you don't need to worry about that when loading again: torch.save(dict(model=model, state_dict=model.state_dict()), f'{save_model_to}.pt')

dimstudio commented 4 years ago

I solved that issue, I had to retrain with the latest code. I am going to upload an example of an online sample to classify

dimstudio commented 4 years ago

You can have a look at main.py which works with example_request.txt. Did I fail to convert it into a tensor?

HansBambel commented 4 years ago

The batch variable in process_data() is a dataframe and it contains a lot of NaNs. I think there is something wrong.

The model expects a three-dimensional tensor. The first one being the batch-size. So in our case only 1. Making the input shape something like 1x79x52. (At training time it is something like 64x17x52)

dimstudio commented 4 years ago

What about now? I have 17x52 of dimension

HansBambel commented 4 years ago

np.stack didnt work. I substituted it (in #17 ) with expand_dims. But there is still the issue with NaNs

dimstudio commented 4 years ago

Are you sure? I do not have any empty value in the batch

Before resampling: (143, 52)
Shape of the interval is (17, 52)
Shape of the batch is (1, 17, 52)
Batch is containing nulls? False
Traceback (most recent call last):
  File "C:/Users/Daniele-WIN10/Documents/GitHub/SharpFlow/main.py", line 184, in <module>
    exampleData()
  File "C:/Users/Daniele-WIN10/Documents/GitHub/SharpFlow/main.py", line 135, in exampleData
    return_dict = process_data()
  File "C:/Users/Daniele-WIN10/Documents/GitHub/SharpFlow/main.py", line 159, in process_data
    result = online_classification("models/lstm",batch)
  File "C:/Users/Daniele-WIN10/Documents/GitHub/SharpFlow/main.py", line 171, in online_classification
    scaled_data = scaler.transform(input_sample)
  File "C:\Users\Daniele-WIN10\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 387, in transform
    force_all_finite="allow-nan")
  File "C:\Users\Daniele-WIN10\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 539, in check_array
    % (array.ndim, estimator_name))
ValueError: Found array with dim 3. Estimator expected <= 2.

Process finished with exit code 1

To me, it looks like the problem it is more how the scaler is called

HansBambel commented 4 years ago

Yes, that is true. I put expand_dims now behind the scaling and force the data to be a tensor (in #18 ), but there is still an error from the NaNs.

So now he is only complaining about the actual input values.

dimstudio commented 4 years ago

Nice! it's working. Thanks @HansBambel !

Before resampling: (143, 52)
Shape of the interval is (17, 52)
Shape of the batch is (1, 17, 52)

{'classRelease': 1, 'classDepth': 1, 'classRate': 1, 'armsLocked': 1, 'bodyWeight': 1}

Process finished with exit code 0

dimstudio / SharpFlow

Real-time classification - main.py #5