emcf / thepipe

Extract markdown and images from URLs, PDFs, docs, slides, and more, ready for multimodal LLMs. ⚡
https://thepi.pe
MIT License
814 stars 61 forks source link

Video frame + transcript extraction #7

Closed emcf closed 2 months ago

emcf commented 2 months ago

Looking to support extraction of mp4, mov, webm, avi files as well as youtube for a Vision-Language model (not a video model)

Video and audio is not standard in commercial multimodal models today. Because of this, I am looking to transcribe audio.