georgian-io / Multimodal-Toolkit

Multimodal model for text and tabular data with HuggingFace transformers as building block for text data
https://multimodal-toolkit.readthedocs.io
Apache License 2.0
587 stars 84 forks source link

Issue of Loading Data #60

Closed hytting closed 9 months ago

hytting commented 9 months ago

Hi there,

When I tried to load the train/val/test set csv file that I splitted with load_data_from_folder in multimodal_transformers.data, the returned train_dataset/val_dataset/test_dataset will give me a strange length, which is totally not related to the original length of the csv file. for example, the train_df.shape = (105195,25), while the train_dataset.cat_feats.shape = (131495,38).

For spliting dataset, I tried train_test_split and np.split, but they both gave me the same issue with loading.

But if I followed the exact same code in the notebook for splitting datasets, load_data_from_folder would work well. At the same time, if I modify one column, such as match the number with words from [0,2,0...] to [A,B,A...], it also cannot load in the correct way.

Does anyone have any suggestions?

hytting commented 9 months ago

Hi I have found out the problem. When pip install multimodal-transformers, somehow the 0.11a0 version was installed instead of the latest one. In 0.11a0, there is a bug in the load_data.py file and it's updated in the newest version: train_df=data_df.iloc[:len_train]. (The old version use df.loc[train_df.index])

So I manually changed the py file and it works now.