`WhisperTranscriber` to add filename to document metadata

TuanaCelik commented 1 year ago

It would be great if we provided the option to add the filename to the metadata of the documents that the WhisperTranscribercreates. Currently there's not good way of doing this. This would really help when building RAG pipelines where you want to query videos, but you want to reference the video in the response.

TuanaCelik commented 12 months ago

Additional learning with @anakin87 : It seems that even if we want to add the meta via an indexing pipeline, as shown below, the meta will get ignored. I think this might be because the root node (Whisper) ignores the meta.

The indexing pipeline:

whisper = WhisperTranscriber(api_key=api_key)

indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=whisper, name="Whisper", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="Preprocessor", inputs=["Whisper"])
indexing_pipeline.add_node(component=embedder, name="Embedder", inputs=["Preprocessor"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Embedder"])

videos = ["https://www.youtube.com/watch?v=h5id4erwD4s", "https://www.youtube.com/watch?v=iFUeV3aYynI"]

# for video in videos:
file_path1 = youtube2audio("https://www.youtube.com/watch?v=h5id4erwD4s")
file_path2 = youtube2audio("https://www.youtube.com/watch?v=iFUeV3aYynI")
doc1 = {'file_path': file_path1, "url": "https://www.youtube.com/watch?v=h5id4erwD4s"}
doc2 = {'file_path': file_path2, "url": "https://www.youtube.com/watch?v=iFUeV3aYynI"}

indexing_pipeline.run(file_paths=[doc1['file_path'], doc2['file_path']], meta=[{"url": doc['url'] for doc in [doc1, doc2]}])

anakin87 commented 12 months ago

As Tuana said, meta is ignored.

See, for example, the run method: https://github.com/deepset-ai/haystack/blob/a5b815690ed7343882603a675c621ffc4c129c9b/haystack/nodes/audio/whisper_transcriber.py#L176-L186

deepset-ai / haystack

`WhisperTranscriber` to add filename to document metadata #5716