AIModelShare / aimodelshare

https://www.modelshare.org/
MIT License
38 stars 2 forks source link

Harvard Harmful Brain Activity Classification Contest - importing incompatibilities. #235

Open RTSRLLC opened 1 year ago

RTSRLLC commented 1 year ago

Background

I have pre-existing models trained on both public and private patient EEGs, unrelated to the Harvard Harmful Brain Activity Classification Contest sets. My intention is to directly make predictions on the Harvard sets—unseen so far—after performing necessary preprocessing steps on the data.

Upon reviewing the Harvard data (please correct me if I'm mistaken), I've noted that there are four batches containing raw EEG data. My focus lies solely on the 50-second segments. I am equipped to preprocess this data without needing additional training.

My understanding—apologies if incorrect—is that the .pickle files are the prediction sets. These files are in a 16 x 2000 format, and I presume they need transposing for prediction. However, the transposed feature count (16) is fewer than what the 50-second segments contain (20 features, including the EEG).

Current Workflow and Challenges

As a professional transitioning from pure math to Python programming and Data Science, I've found understanding ModelShareAI and the accompanying data quite challenging. My experience with Deep Learning primarily involves TensorFlow.

The ideal situation is to preprocess the Harvard prediction sets to match the input format of my saved models, enabling me to make predictions for the contest. If my models are robust, they should be capable of working with any data, aligning with the contest's objective.

Technical Issues

Researching this led me to understand how I can install aimodelshare via conda without conflicts with the packages used for my model training. However, I've encountered an issue, detailed below:

I've faced difficulties when trying to conda install aimodelshare from an existing environment or when creating a new conda environment. An UnsatisfiableError occurs in the terminal, suggesting that the package requires both a Mac OS and a Windows OS—a specification conflict for my Mac.

The actual terminal message is:

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versionsThe following specifications were found to be incompatible with your system:

Your installed version is: 13.5

Note that strict channel priority may have removed packages required for satisfiability.

Here are the packages installed when I create a base conda environment: (base) jshensley@MacBook-Pro conda_envs % conda activate ModelShareAI (/Users/jshensley/Desktop/conda_envs/ModelShareAI) jshensley@MacBook-Pro conda_envs % conda list packages in environment at /Users/jshensley/Desktop/conda_envs/ModelShareAI:

Name Version Build Channel bzip2 1.0.8 h3422bc3_4 conda-forge ca-certificates 2023.5.7 hf0a4a13_0 conda-forge libffi 3.4.2 h3422bc3_5 conda-forge libsqlite 3.42.0 hb31c410_0 conda-forge libzlib 1.2.13 h53f4e23_5 conda-forge ncurses 6.4 h7ea286d_0 conda-forge openssl 3.1.1 h53f4e23_1 conda-forge pip 23.1.2 pyhd8ed1ab_0 conda-forge python 3.10.12 h01493a6_0_cpython conda-forge readline 8.2 h92ec313_1 conda-forge setuptools 68.0.0 pyhd8ed1ab_0 conda-forge tk 8.6.12 he1e0b03_0 conda-forge tzdata 2023c h71feb2d_0 conda-forge wheel 0.40.0 pyhd8ed1ab_0 conda-forge xz 5.2.6 h57fd34a_0 conda-forge

After that, I attempt to install aimodelshare: conda install aimodelshare

Unfortunately, this results in the following:

Found conflicts! Looking for incompatible packages. This can take several minutes. Press CTRL-C to abort. failed

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versionsThe following specifications were found to be incompatible with your system:

Your installed version is: 13.5

Note that strict channel priority may have removed packages required for satisfiability.

Inquiries

  1. How can I access the aimodelshare requirements to ensure compatibility with my current conda packages and/or macOS M1?
  2. Is there a direct method to preprocess the data and upload predictions to the contest site using my pre-existing models, to enhance efficiency?

I would appreciate any clarification on these processes. Warmly, Scott

mikedparrott commented 1 year ago

Hi and thanks for taking the time to troubleshoot the Conda installation of aimodelshare for the competition. Your best bet is to manually submit predictions rather than using our library in Python.

You can navigate to the competition page and competition sub-tab at modelshare.ai and then click the red "submit predictions" button. You will see the following popup box and you can use it to download a csv file with example predictions. Simply replace these example predictions with the predictions from your model and upload them to generate a score on the leaderboard. Screen Shot 2023-07-07 at 2 20 50 PM

Unfortunately we will not be able to troubleshoot any installation issues until after the competition (and this Conda issue seems like a particularly knotty problem).

RTSRLLC commented 1 year ago

Thank you for you response, Mr. Parrot. Very kind of you. I know that you are very busy. I am having trouble understanding the 'votes' target column: self._infos['majority'] = np.argmax(self._infos.values, axis=1) I'm more or less self-taught and I'm sorry if this questions seem trivial and I would love to provide predictions to this contest. I work in pycharm and in Tensorflow. I converted the Columbia Class code provided on Model Share from torch to Tensorflow. I will include that code block at the bottom if you or one of your students would like to review it.

These lines of code: if name == 'main':

Define your data paths

training_data = '/Users/jshensley/Desktop/Harvard_contest_DONTDELETE/Harvard_data_sets/Harvard_training/batch01'
valid_data = '/Users/jshensley/Desktop/Harvard_contest_DONTDELETE/Harvard_data_sets/Havard_validation/batch03'
test_data = '/Users/jshensley/Desktop/Harvard_contest_DONTDELETE/Harvard_data_sets/Harvard_test/batch04'

# Create the dataset objects for each data split
# def __init__(self, variant_tensor, split, split_ratio=None, seed=42, debug=True, data_dir=None):
train_dataset = ColumbiaData('train', data_dir=training_data).as_dataset()
val_dataset = ColumbiaData('val', data_dir=valid_data).as_dataset()
test_dataset = ColumbiaData('test', data_dir=test_data).as_dataset()

# Apply batch and prefetch for efficient loading
batch_size = 32  # You can change the batch size as needed
train_dataset = train_dataset.batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)
val_dataset = val_dataset.batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)
test_dataset = test_dataset.batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)

train_subjects = train_dataset.reduce(0, lambda x, _: x + 1)
val_subjects = val_dataset.reduce(0, lambda x, _: x + 1)
test_subjects = test_dataset.reduce(0, lambda x, _: x + 1)

print(f"number of train subjects: {train_subjects.numpy()}")
print(f"number of val subjects: {val_subjects.numpy()}")
print(f"number of test subjects: {test_subjects.numpy()}")

for sample in train_dataset:
    print("=" * 80)
    print("=" * 80)
    print(f"sample is:\n{type(sample)}")
    data = sample['data
    target = sample['target']
    index = sample['index']
    print(f"Sample index: {index[:3]}\n")
    print(f"Data shape: {data.shape} Data type: {type(data.numpy())}\n")
    print(f"Target shape: {target.shape}")
    print("=" * 80)

produces this: . Image 7-11-23 at 15 30

The target column size is equal to the batch size. If I put print statements in the Columbia Class where the votes are processed:

print("Reading labels...") _all = [_read_labels(_path) for _path in tqdm.tqdm(_all)] self.infos = pd.DataFrame([['votes'].astype(int) for _ in _all]) print("Creating y_train...") self._infos['majority'] = np.argmax(self._infos.values, axis=1) print(self._infos.head(), self._infos.shape, sep='\n') for _col in ['subject_ID', 'key', 'path']: self._infos[col] = [[col] for in _all] print(self._infos.head(), self._infos.shape, sep='\n')

Image 7-11-23 at 15 45

I'm sure I'm missing something obvious, and I really want to train my models on this data, and I'm running out of time. Do you or any of your students/assistants have any insight that could assist me in getting these target columns attached to the X_dfs. I'm having no problems getting the X_trains, only the y_trains.

Thank you for your time. Here is the complete code should you or anyone like to review it. Scott

import glob import os

import mat73 import numpy as np import pandas as pd import persist_to_disk as ptd import scipy.stats import tqdm import tensorflow as tf from abc import ABC, abstractmethod from scipy.signal import butter, filtfilt, iirnotch

training_data = '/Users/jshensley/Desktop/Harvard_contest_DONTDELETE/Harvard_data_sets/Harvard_training'

valid_data = '/Users/jshensley/Desktop/Harvard_contest_DONTDELETE/Harvard_data_sets/Havard_validation'

test_data = '/Users/jshensley/Desktop/Harvard_contest_DONTDELETE/Harvard_data_sets/Harvard_test'

SAMPLING_RATE = 200 CLASSES = ['Other', 'Seizure', 'LPD', 'GPD', 'LRDA', 'GRDA']

TRAIN = 'train' VALID = 'val' TEST = 'test' ##################### 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 og_cols_from_og_data = ['Fp1', 'F7', 'T3', 'T5', 'O1', 'F3', 'C3', 'P3', 'Fz', 'Cz', 'Pz', 'Fp2', 'F8', 'T4', 'T6', 'O2', 'F4', 'C4', 'P4'] numz_n_cols = list(enumerate(og_cols_from_og_data))

Harvard_cols below

Harvard_cols = ['Fp1-O1', 'O1-F3', 'F3-C3', 'C3-P3', 'Fp2-O2', 'O2-F4', 'F4-C4', 'C4-P4', 'Fp1-F7', 'F7-T3', 'T3-T5', 'T5-P3', 'Fp2-F8', 'F8-T4', 'T4-T6', 'T6-P4']

def channel_transform(X):

to bi-polar signals

temp = np.zeros_like(X)
temp[0] = X[0] - X[4]  # Fp1 - O1
temp[1] = X[4] - X[5]  # O1 - F3
temp[2] = X[5] - X[6]  # F3 - C3
temp[3] = X[6] - X[7]  # C3 - P3
temp[4] = X[11] - X[15]  # Fp2 - O2
temp[5] = X[15] - X[16]  # O2 - F4
temp[6] = X[16] - X[17]  # F4 - C4
temp[7] = X[17] - X[18]  # C4 - P4
temp[8] = X[0] - X[1]  # Fp1 - F7
temp[9] = X[1] - X[2]  # F7 - T3
temp[10] = X[2] - X[3]  # T3 - T5
temp[11] = X[3] - X[7]  # T5 - P3
temp[12] = X[11] - X[12]  # Fp2 - F8
temp[13] = X[12] - X[13]  # F8 - T4
temp[14] = X[13] - X[14]  # T4 - T6
temp[15] = X[14] - X[18]  # T6 - P4
return temp[:16].astype("float64")

def denoise_channel(ts, bandpass, notch_freq, signal_freq): """ bandpass: (low, high) """ nyquist_freq = 0.5 * signal_freq filter_order = 2

low = bandpass[0] / nyquist_freq
high = bandpass[1] / nyquist_freq
b, a = butter(filter_order, [low, high], btype="band")
ts_out = filtfilt(b, a, ts)

quality_factor = 30.0
b_notch, a_notch = iirnotch(notch_freq, quality_factor, signal_freq)
ts_out = filtfilt(b_notch, a_notch, ts_out)

return np.array(ts_out)

def _read_and_transform_x(data_path, sampling_rate=SAMPLING_RATE): """This function reads the data from the mat file and transforms it. You could customize your own transforms instead of using this sequence. """ x = mat73.loadmat(data_path)['data_50sec']

Step 1: Take the middle 10 seconds out of 50 seconds

x = x[:, sampling_rate * (20): sampling_rate * (30)]
# Step 2: transform to bi-polar signals
x = channel_transform(x)
# print(f"channel_transform X.shape line 82: {x.shape}")
# Step 3: perform bandpass and notch filter
# x = denoise_channel(x, [0.5, 40.0], 60.0, sampling_rate)
return x

def _read_labels(data_path): data_dict = mat73.loadmat(data_path)

print(f"Line 87 data_dict.keys(): {data_dict.keys()}")

return {'subject_ID': data_dict['subject_ID'], 'votes': data_dict['votes'],
        'path': data_path, 'key': os.path.basename(data_path).split('.')[0]}

def get_split_indices(seed, split_ratio, n, names=None): """Compute the split indices for a given seed and split ratio.""" if names is None: names = [TRAIN, VALID, TEST] assert len(split_ratio) in {2, 3} perm = np.random.RandomState(seed).permutation(n) split_ratio = np.asarray(split_ratio).cumsum() / sum(split_ratio) cuts = [int(_s * n) for _s in split_ratio] return { names[i]: perm[cuts[i - 1]:cuts[i]] if i > 0 else perm[:cuts[0]] for i in range(len(split_ratio)) }

def create_tf_dataset(X, y): """ X shape must be """ return tf.data.Dataset.from_tensor_slices((X, y))

class ColumbiaData(tf.data.Dataset): DATASET = 'IIIC' CLASSES = CLASSES LABEL_MAP = {_n: _i for _i, _n in enumerate(CLASSES)}

def __init__(self, variant_tensor, split=None, split_ratio=None, seed=42, debug=True, data_dir=None):
    if split_ratio is None:
        split_ratio = [0.6, 0.2, 0.2]
    super(ColumbiaData, self).__init__(variant_tensor)  # Pass the variant_tensor argument to super()

    _all = glob.glob(f"{data_dir}/*.mat")
    if debug:
        _all = [f for i, f in enumerate(_all) if i % 50 == 0]
    print("Reading labels...")
    _all = [_read_labels(_path) for _path in tqdm.tqdm(_all)]
    self._infos = pd.DataFrame([_['votes'].astype(int) for _ in _all])
    print("Creating y_train...")
    self._infos['majority'] = np.argmax(self._infos.values, axis=1)
    print(self._infos.head(), self._infos.shape, sep='\n')
    for _col in ['subject_ID', 'key', 'path']:
        self._infos[_col] = [_[_col] for _ in _all]
        print(self._infos.head(), self._infos.shape, sep='\n')

    # Create and pick the corresponding split
    if split is not None:
        print("Splitting Patients...")
        PID_COL = 'subject_ID'
        pids = sorted(self._infos[PID_COL].unique())
        pid_indices = get_split_indices(seed, split_ratio, len(pids))[split]
        pids = [pids[i] for i in pid_indices]
        self._infos = self._infos[self._infos[PID_COL].isin(pids)]
    print("Reading signals...")
    self.X = {row['key']: _read_and_transform_x(row['path']) for idx, row in
              tqdm.tqdm(self._infos.iterrows(), total=len(self._infos))}

    self.majority_only = True

def _normalized(self, x, norm=2.5):
    # This is sample-wise rescale.
    # Recording-wise normalization might improve results.
    lb = np.percentile(x, norm)
    ub = np.percentile(x, 100 - norm)
    x = x / np.clip(ub - lb, 1e-3, None)
    return x

def _generator(self):
    for idx in range(len(self._infos)):
        record = self._infos.iloc[idx]
        key = record['key']
        X = self._normalized(self.X[key])
        target = record['majority']
        if not self.majority_only:
            V = record.reindex(columns=range(len(self.CLASSES))).values.astype(float)
            entropy = scipy.stats.entropy(V)
            target = np.asarray([target, entropy] + list(V))
        yield {'data': X, 'target': np.expand_dims(target, axis=0), 'index': key}

def as_dataset(self):
    # Construct a data pipeline using tf.data.Dataset
    return tf.data.Dataset.from_generator(
        self._generator,
        output_signature={
            'data': tf.TensorSpec(shape=(16, 2000), dtype=tf.float64),
            'target': tf.TensorSpec(shape=(None,), dtype=tf.float64),
            'index': tf.TensorSpec(shape=(), dtype=tf.string),
        }
    )

def _inputs(self):
    return []

@property
def element_spec(self):
    return {
        'data': tf.TensorSpec(shape=(16, 2000), dtype=tf.float32),
        'target': tf.TensorSpec(shape=(None,), dtype=tf.float32),
        'index': tf.TensorSpec(shape=(), dtype=tf.string),
    }

if name == 'main':

Define your data paths

training_data = '/Users/jshensley/Desktop/Harvard_contest_DONTDELETE/Harvard_data_sets/Harvard_training/batch01'
valid_data = '/Users/jshensley/Desktop/Harvard_contest_DONTDELETE/Harvard_data_sets/Havard_validation/batch03'
test_data = '/Users/jshensley/Desktop/Harvard_contest_DONTDELETE/Harvard_data_sets/Harvard_test/batch04'

# Create the dataset objects for each data split
# def __init__(self, variant_tensor, split, split_ratio=None, seed=42, debug=True, data_dir=None):
train_dataset = ColumbiaData('train', data_dir=training_data).as_dataset()
val_dataset = ColumbiaData('val', data_dir=valid_data).as_dataset()
test_dataset = ColumbiaData('test', data_dir=test_data).as_dataset()

# Apply batch and prefetch for efficient loading
batch_size = 32  # You can change the batch size as needed
train_dataset = train_dataset.batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)
val_dataset = val_dataset.batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)
test_dataset = test_dataset.batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)

train_subjects = train_dataset.reduce(0, lambda x, _: x + 1)
val_subjects = val_dataset.reduce(0, lambda x, _: x + 1)
test_subjects = test_dataset.reduce(0, lambda x, _: x + 1)

print(f"number of train subjects: {train_subjects.numpy()}")
print(f"number of val subjects: {val_subjects.numpy()}")
print(f"number of test subjects: {test_subjects.numpy()}")

for sample in train_dataset:
    print("=" * 80)
    print("=" * 80)
    print(f"sample is:\n{type(sample)}")
    data = sample['data']
    target = sample['target']
    index = sample['index']
    print(f"Sample index: {index[:3]}\n")
    print(f"Data shape: {data.shape} Data type: {type(data.numpy())}\n")
    print(f"Target shape: {target.shape}")
    print("=" * 80)

X_train_df_list = []
# Define column names
Harvard_cols = ['Fp1-O1', 'O1-F3', 'F3-C3', 'C3-P3', 'Fp2-O2', 'O2-F4', 'F4-C4', 'C4-P4', 'Fp1-F7', 'F7-T3',
                'T3-T5', 'T5-P3', 'Fp2-F8', 'F8-T4', 'T4-T6', 'T6-P4']
for sample in train_dataset:
    # Reshape the data and create a DataFrame for each sample
    data_i = tf.reshape(sample['data'], (-1, 16)).numpy()
    # Convert the target data to a numpy array
    target_i = tf.reshape(sample['target'], (-1,)).numpy()
    # Create a DataFrame with the data
    df_i = pd.DataFrame(data_i, columns=Harvard_cols)
    df_i['Target'] = target_i  # Add the target column
    X_train_df_list.append(df_i)