Issue of Loading Data - Githubissues

georgian-io / Multimodal-Toolkit

Multimodal model for text and tabular data with HuggingFace transformers as building block for text data

Apache License 2.0

587 stars 84 forks source link

Hi there,

When I tried to load the train/val/test set csv file that I splitted with load_data_from_folder in multimodal_transformers.data, the returned train_dataset/val_dataset/test_dataset will give me a strange length, which is totally not related to the original length of the csv file. for example, the train_df.shape = (105195,25), while the train_dataset.cat_feats.shape = (131495,38).

For spliting dataset, I tried train_test_split and np.split, but they both gave me the same issue with loading.

But if I followed the exact same code in the notebook for splitting datasets, load_data_from_folder would work well. At the same time, if I modify one column, such as match the number with words from [0,2,0...] to [A,B,A...], it also cannot load in the correct way.

Does anyone have any suggestions?

georgian-io / Multimodal-Toolkit

Issue of Loading Data #60