bitcointranscripts / tstbtc

This cli app transcribe audio and videos for submission to the bitcointranscripts repo
MIT License
6 stars 8 forks source link

Consider temporary dir for metadata #125

Closed carlaKC closed 3 months ago

carlaKC commented 3 months ago

If metadata directory isn't configured, tstbtc will use /metadata as its default. This fails on macos with Error while transcribing: [Errno 30] Read-only file system: '/metadata'.

I haven't had much use for this data while using the tool myself, so perhaps this can just be a temporary dir?

Full logs ``` 2024-08-12 10:57:45,861 [ERROR] (deepgram) Error writing JSON file for Lightning Specification Meeting: [Errno 30] Read-only file system: '/metadata' 2024-08-12 10:57:45,863 [ERROR] Error with the transcription: (deepgram) Error while transcribing: [Errno 30] Read-only file system: '/metadata' 2024-08-12 10:57:45,863 [INFO] Exited with error, not cleaning up temp files: /var/folders/5v/n5_hx3dd22x5cfv1c2h7hbk80000gn/T/tmp8d1290e_ Traceback (most recent call last): File "/Users/carla/Work/bitcointranscripts/tstbtc/venv/lib/python3.11/site-packages/app/services/deepgram.py", line 683, in transcribe transcript.transcription_service_output_file = self.write_to_json_file( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/carla/Work/bitcointranscripts/tstbtc/venv/lib/python3.11/site-packages/app/services/deepgram.py", line 69, in write_to_json_file transcription_service_output_file = self.data_writer.write_json( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/carla/Work/bitcointranscripts/tstbtc/venv/lib/python3.11/site-packages/app/data_writer.py", line 33, in write_json output_file = self.construct_file_path( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/carla/Work/bitcointranscripts/tstbtc/venv/lib/python3.11/site-packages/app/data_writer.py", line 45, in construct_file_path os.makedirs(target_file_path, exist_ok=True) File "", line 215, in makedirs File "", line 215, in makedirs File "", line 225, in makedirs OSError: [Errno 30] Read-only file system: '/metadata' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/carla/Work/bitcointranscripts/tstbtc/venv/lib/python3.11/site-packages/app/transcription.py", line 334, in start transcript = self.service.transcribe(transcript) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/carla/Work/bitcointranscripts/tstbtc/venv/lib/python3.11/site-packages/app/services/deepgram.py", line 694, in transcribe raise Exception(f"(deepgram) Error while transcribing: {e}") Exception: (deepgram) Error while transcribing: [Errno 30] Read-only file system: '/metadata' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/Users/carla/Work/bitcointranscripts/tstbtc/venv/lib/python3.11/site-packages/transcriber.py", line 288, in transcribe transcription.start() File "/Users/carla/Work/bitcointranscripts/tstbtc/venv/lib/python3.11/site-packages/app/transcription.py", line 342, in start raise Exception(f"Error with the transcription: {e}") from e Exception: Error with the transcription: (deepgram) Error while transcribing: [Errno 30] Read-only file system: '/metadata' ```
kouloumos commented 3 months ago

Thanks for pointing that out. It seems there was a mistake with the path configuration; it was intended to be relative, but an absolute path was used instead. I've addressed this issue with https://github.com/bitcointranscripts/tstbtc/commit/1845a6ed5e96a05f00445b0391f13dc988f69536

To clarify, the metadata directory contains raw transcription model outputs (from Whisper or Deepgram). This data is important for recreating the final markdown output if there are errors during post-processing or if you need to reprocess the data in the future.

Initially, the design didn’t include a temporary directory because the goal is to develop the bitcointranscripts-metadata repository into an archive for these data. Exposing the metadata to users was the first step, with plans to automate commits to this repository in the future.

I understand that in your case, the original AI-generated transcripts are not publicly exposed. I can add a configuration option for a temporary directory to better suit your needs.

carlaKC commented 3 months ago

I understand that in your case, the original AI-generated transcripts are not publicly exposed. I can add a configuration option for a temporary directory to better suit your needs

No worries! I don't specifically need the metadata in a temporary dir - that's just the usual place I throw data I don't need. The fix to just use a relative path works just fine for my use case :)