Eliorkalfon / single_cell_pb

Deep learning models for the Kaggle's Open Problems – Single-Cell Perturbations competition
7 stars 4 forks source link

A quick question about prediction #9

Open HelloWorldLTY opened 5 months ago

HelloWorldLTY commented 5 months ago

Hi, thanks for your great work. After running your training step, I tried to reproduce the prediction process:

{'n_components_list': [18211], 'd_models_list': [128], 'batch_size': 32, 'data_file': 'de_train.parquet', 'id_map_file': 'id_map.csv', 'device': 'cuda', 'seed': None, 'models_dir': 'trained_models'}
      id  A1BG  A1BG-AS1  A2M  A2M-AS1  A2MP1  A4GALT  AAAS  AACS  AAGAB  AAK1  AAMDC  ...  ZSWIM8  ZSWIM9  ZUP1  ZW10  ZWILCH  ZWINT  ZXDA  ZXDB  ZXDC  ZYG11B  ZYX  ZZEF1
0      0   0.0       0.0  0.0      0.0    0.0     0.0   0.0   0.0    0.0   0.0    0.0  ...     0.0     0.0   0.0   0.0     0.0    0.0   0.0   0.0   0.0     0.0  0.0    0.0
1      1   0.0       0.0  0.0      0.0    0.0     0.0   0.0   0.0    0.0   0.0    0.0  ...     0.0     0.0   0.0   0.0     0.0    0.0   0.0   0.0   0.0     0.0  0.0    0.0
2      2   0.0       0.0  0.0      0.0    0.0     0.0   0.0   0.0    0.0   0.0    0.0  ...     0.0     0.0   0.0   0.0     0.0    0.0   0.0   0.0   0.0     0.0  0.0    0.0
3      3   0.0       0.0  0.0      0.0    0.0     0.0   0.0   0.0    0.0   0.0    0.0  ...     0.0     0.0   0.0   0.0     0.0    0.0   0.0   0.0   0.0     0.0  0.0    0.0
4      4   0.0       0.0  0.0      0.0    0.0     0.0   0.0   0.0    0.0   0.0    0.0  ...     0.0     0.0   0.0   0.0     0.0    0.0   0.0   0.0   0.0     0.0  0.0    0.0
..   ...   ...       ...  ...      ...    ...     ...   ...   ...    ...   ...    ...  ...     ...     ...   ...   ...     ...    ...   ...   ...   ...     ...  ...    ...
250  250   0.0       0.0  0.0      0.0    0.0     0.0   0.0   0.0    0.0   0.0    0.0  ...     0.0     0.0   0.0   0.0     0.0    0.0   0.0   0.0   0.0     0.0  0.0    0.0
251  251   0.0       0.0  0.0      0.0    0.0     0.0   0.0   0.0    0.0   0.0    0.0  ...     0.0     0.0   0.0   0.0     0.0    0.0   0.0   0.0   0.0     0.0  0.0    0.0
252  252   0.0       0.0  0.0      0.0    0.0     0.0   0.0   0.0    0.0   0.0    0.0  ...     0.0     0.0   0.0   0.0     0.0    0.0   0.0   0.0   0.0     0.0  0.0    0.0
253  253   0.0       0.0  0.0      0.0    0.0     0.0   0.0   0.0    0.0   0.0    0.0  ...     0.0     0.0   0.0   0.0     0.0    0.0   0.0   0.0   0.0     0.0  0.0    0.0
254  254   0.0       0.0  0.0      0.0    0.0     0.0   0.0   0.0    0.0   0.0    0.0  ...     0.0     0.0   0.0   0.0     0.0    0.0   0.0   0.0   0.0     0.0  0.0    0.0

[255 rows x 18212 columns]
(1, 31, 18211)
Traceback (most recent call last):
  File "predict.py", line 87, in <module>
    main()
  File "predict.py", line 83, in main
    predict_test(unseen_data, transformer_models, n_components_list, d_models_list, batch_size, device=device)
  File "/gpfs/gibbs/project/zhao/tl688/conda_envs/openproblem/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "predict.py", line 33, in predict_test
    submission_df.insert(0, 'id', range(255))
  File "/gpfs/gibbs/project/zhao/tl688/conda_envs/openproblem/lib/python3.8/site-packages/pandas/core/frame.py", line 4776, in insert
    value = self._sanitize_column(value)
  File "/gpfs/gibbs/project/zhao/tl688/conda_envs/openproblem/lib/python3.8/site-packages/pandas/core/frame.py", line 4870, in _sanitize_column
    com.require_length_match(value, self.index)
  File "/gpfs/gibbs/project/zhao/tl688/conda_envs/openproblem/lib/python3.8/site-packages/pandas/core/common.py", line 576, in require_length_match
    raise ValueError(
ValueError: Length of values (255) does not match length of index (31)

However, I received the bugs mentioned above. It seems that the output combined_emb has dim with 31, but the target sample submission has dim 255. Are there anything wrong?

My config file looks like:

n_components_list: # targets dimension list
  - 18211
d_models_list:
  - 128
batch_size: 32
data_file: 'de_train.parquet'
id_map_file: 'id_map.csv'
device: cuda
seed: null
models_dir: 'trained_models'

Thanks a lot.

HelloWorldLTY commented 5 months ago

Hi, I modified the test codes to this following:

@torch.no_grad()
def predict_test(data, models, n_components_list, d_list, batch_size, device='cuda'):
    num_samples = len(data)
    for i, n_components in enumerate(n_components_list):
        for j, d_model in enumerate(d_list):
            combined_outputs = []
            label_reducer, scaler, transformer_model = models[f'{n_components},{d_model}']
            transformer_model.eval()
            for i in range(0, num_samples, batch_size):
                batch_unseen_data = data[i:i + batch_size]
                transformed_data = transformer_model(batch_unseen_data)
                if scaler:
                    transformed_data = torch.tensor(scaler.inverse_transform(
                        label_reducer.inverse_transform(transformed_data.cpu().detach().numpy()))).to(device)
                print(transformed_data.shape)
                combined_outputs.append(transformed_data)

            # Stack the combined outputs
            combined_outputs = torch.vstack(combined_outputs)
            sample_submission = pd.read_csv(
                f"./sample_submission.csv")
            print(sample_submission)
            print(combined_outputs.cpu().detach().numpy().shape)
            sample_columns = sample_submission.columns
            sample_columns = sample_columns[1:]
            submission_df = pd.DataFrame(combined_outputs.cpu().detach().numpy(), columns=sample_columns)
            submission_df.insert(0, 'id', range(255))
            submission_df.to_csv(f"result_{n_components}_{d_model}.csv", index=False)

Then I will have a matrix with shape 255*18211. Is it correct? Thanks.