Alcoholrithm / TabularS3L

A PyTorch Lightning-based library for self- and semi-supervised learning on tabular data.
MIT License
26 stars 2 forks source link

Out of Index Error #9

Closed jeetsh4h closed 6 months ago

jeetsh4h commented 6 months ago
metric = "mean_squared_error"
input_dim = X_train.shape[1]
hidden_dim = 128
output_dim = 1

encoder_depth = 3
n_head = 2
u_label = -1

batch_size = 32

config = SwitchTabConfig( 
    task="regression", loss_fn="CrossEntropyLoss",
    metric=metric, metric_hparams={},
    input_dim=input_dim, hidden_dim=hidden_dim,
    output_dim=output_dim, encoder_depth=encoder_depth,
    n_head=n_head, u_label=u_label
)

pl_switchtab = SwitchTabLightning(config)

### First Phase Learning
train_ds = SwitchTabDataset(
    X=X_train, unlabeled_data=X_unlabeled, 
    Y=y_train.values, config=config, 
    continuous_cols=continuous_cols, 
    category_cols=category_cols, is_regression=True
)
valid_ds = SwitchTabDataset(
    X=X_val, config=config, 
    Y=y_val.values, continuous_cols=continuous_cols, 
    category_cols=category_cols, is_regression=True
)

datamodule = TS3LDataModule(
    train_ds, valid_ds, batch_size, 
    train_sampler='weighted',
    train_collate_fn=SwitchTabFirstPhaseCollateFN(), 
    valid_collate_fn=SwitchTabFirstPhaseCollateFN()
)

trainer = Trainer(
    accelerator='cpu',
    max_epochs=20,
    num_sanity_val_steps=2,
)

trainer.fit(pl_switchtab, datamodule)

The error that I have received is as follows:

IndexError: Caught IndexError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.10/dist-packages/ts3l/utils/switchtab_utils/data_utils.py", line 105, in getitem return self.getitem(idx) File "/usr/local/lib/python3.10/dist-packages/ts3l/utils/switchtab_utils/data_utils.py", line 116, in first_phase_get_item x_1 = self.data[idx] IndexError: index 4754 is out of bounds for dimension 0 with size 3341

I am using the abalone dataset from UCI to test the SwitchTab model. It would be really useful if you could release the code snippets that were used to generate the results mentioned in the README file that compares all the models with each other. Better yet, a way to set up a docs website where we can help add more documentation to all the APIs, more specifically for the classes present in utils.

Alcoholrithm commented 6 months ago

Hi @jeetsh4h,

Thank you for reporting the issue. The error you encountered is due to the train sampler setting. For a regression task, the train sampler should be set to random instead of weighted.

Additionally, there is another potential issue: the loss function is set to CrossEntropyLoss, which is not appropriate for regression tasks. You should use a regression-specific loss function such as 'MSELoss'.

And, all the codes used to generate the results in the README file are released under the benchmark folder. Please clone the repository and check it out for more details.

Regarding your request for documentation, I understand the need for more comprehensive documentation. Currently, I'm busy and unable to work on it right away, but I will prioritize this when I have time.

If you have any further questions, feel free to reach out.

jeetsh4h commented 6 months ago

Thank you so much! I encountered another Out of Index error

IndexError Traceback (most recent call last) in <cell line: 8>() 6 category_cols=category_cols 7 ) ----> 8 valid_ds = SwitchTabDataset( 9 X=X_val, config=config, 10 Y=y_val.values, continuous_cols=continuous_cols,

1 frames /usr/local/lib/python3.10/dist-packages/ts3l/utils/switchtab_utils/data_utils.py in (.0) 74 75 class_weights = [num_samples/class_counts[i] for i in range(len(class_counts))] ---> 76 self.weights = [class_weights[self.label[i]] for i in range(int(numsamples))] 77 else: 78 self.weights = [1.0 for in range(len(X))]

IndexError: list index out of range

I think the reason this is happening is because SwitchTabDataset assumes that the class labels start from 0 and linearly increase. During the pre-processing of the wine dataset on my end, the class labels ended up being 1, 2, 3 - which led to this error showing up over and over again.

The code for the same:

# y_val looks like this:
# [1, 3, 1, 2, 2, 2, 1, 3, 3, 1, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, 3, 1,
#      3, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 3, 1, 3]

valid_ds = SwitchTabDataset(
    X=X_val, config=config, 
    Y=y_val.values, continuous_cols=continuous_cols, 
    category_cols=category_cols
)
Alcoholrithm commented 6 months ago

In the PyTorch community, class labels are typically expected to start from 0. For instance, CrossEntropyLoss assumes that the range of given labels is [0, C) where C is the number of classes. Given this standard, I don't think there should be an issue.

jeetsh4h commented 6 months ago

I was ignorant of the pyTorch standard. Thank you for your help!