Really poor results using the given features

I am trying to create a multimodal model using MELD dataset. After a really large number of tries, using the given features or features obtained by me(opensmile and wav2vec for audio and simple textual approaches), I alway get poor results, really far way from the ones described in the paper. Today, I decided to load the audio and the text features obtained by the bcLSTM model and concatenate them, just like shown in the baseline file, and use them as input for a new bcLSTM. Basically, i copy-past what the authors have done, and the results are so bad. Has anyone face this problem or are the given features actually really bad? I also tried to apply the same methods and models to the features procesed by me and the results are still bad. I am doing my master's thesis in multimodal emotion recognition and this dataset is not helping at all.

declare-lab / MELD

Really poor results using the given features #46