🎓M3AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset (ACL 2024)
[**📄Paper**](https://arxiv.org/abs/2403.14168) | [**🏠️Homepage**](https://jack-zc8.github.io/M3AV-dataset-page/) | [**📥Download**](./download/) | [**💎Demo**](./demo/) | [**🤖Benchmarks**](./benchmarks/)
Overview
The overview of our 🎓M3AV dataset:
- The first component is slides annotated with simple and complex blocks. They will be merged following some rules.
- The second component is speech containing special vocabulary, spoken and written forms, and word-level timestamps.
- The third component is the paper corresponding to the video. The asterisk (*) denotes that only computer science videos have corresponding papers.
We download various academic lectures ranging from Human-computer Interaction, and Biomedical Sciences to Mathematics as shown in the table above.
News
- [2024-08] 🤖All Benchmarks have been released!
- [2024-06] 🤖Benchmarks of LLaMA-2 and GPT-4 have been released!
- [2024-05] 🎉Our work has been accepted by ACL 2024 main conference!
- [2024-04] 🔥v1.0 has been released! We have further refined all speech data. Specifically, the training set adopts the text-normalised Whisper results, and the development/testing set employs a manual combination of Whisper and microsoft STT results.
Details
The folder demo contains a sample for demonstration.