clarinsi / parlaspeech

Code for bootstrapping ASR datasets from parliamentary recordings and transcripts
Apache License 2.0
4 stars 1 forks source link

ID values are not unique #1

Open ir2718 opened 4 months ago

ir2718 commented 4 months ago

Hi,

I'm interested in using the ParlaSpeech dataset for fine-tuning a transformer model. Before fine-tuning I would like to bin the audio file lengths for stratified sampling, so I can split the data into train, validation, and test sets. When doing this, I noticed that the ids column of the dataset is not unique.

import numpy as np
from datasets import load_dataset
data = load_dataset("classla/ParlaSpeech-HR", columns=["id"], split="train")
ids = np.array(data["id"])
print(len(ids), len(set(ids)))

# (867581, 867573)

Can you explain what's the reason for this?

nljubesi commented 4 months ago

Hi, the IDs refer to the textual part of the material. in the mentioned 8 cases the same sentence is being mapped to a different part of the audio data. It is hard to say which of the two mappings is correct, and it is quite possible in the fascinating world of parliamentary debates a series of sentences (we map speeches, not independent sentences) were pronounced twice.

We did not produce any official train:dev:test split yet, so we would be interested in what you come up with. In the previous iteration of the dataset we did produce a test set with speakers of stratified gender that do not occur in the training nor development data. I think this is the way to be followed here as well. https://aclanthology.org/2022.parlaclarin-1.16.pdf

We will be happy to propagate your split to the dataset itself if it shows to be well thought through. You will get a mention as well! Exciting times, I know.

Would also not mind hearing more about your application / research.

ir2718 commented 4 months ago

I don't think I understand what you mean by IDs referring to the textual part of the data, as there are a lot more duplicates in the text columns.

len(dataset["train"]["text"])
# 867581
len(set(dataset["train"]["text"]))
# 801561
len(dataset["train"]["text_normalised"])
# 867581
len(set(dataset["train"]["text_normalised"]))
# 801558

Regarding the dataset split, my first idea was stratifying using the (previously binned) audio length. Although now that you mention it, it does make sense to add the gender and held out speakers for the test set. I'll try concatenating the gender and the bin values, as well as try to think of a way to exclude certain speakers.