facebookresearch / seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation
Other
10.8k stars 1.05k forks source link

S2S aligned metadata "extension" is a subset of prior metadata release? #467

Open arlofaria-cartesia opened 3 months ago

arlofaria-cartesia commented 3 months ago

The metadata files in docs/m4t/seamless_align_README.md come in several dated revisions. From what I've checked of enA-ptA and enA-esA at least, it seems like the "extension" from Nov 30 is a pure subset of the earlier metadata published on Sep 25. Is it possible to double-check if that's the case, and whether maybe some other extension dataset was intended to be published instead?

To verify:

> zcat seamless.dataset.metadata.public.enA-ptA.withduration.tsv.gz | sort -u | wc
5257334
> zcat seamless.dataset.metadata.public.enA-ptA.withduration.tsv.gz seamless.dataset.metadata.public.enA-ptA.tsv.gz | sort -u | wc
5257334