V3ntus / quint

State of the art audio summarization model.
MIT License
5 stars 1 forks source link

Post Chunk Analysis Tool for Sanity Checking the Success of the Output Files #15

Open turnkit opened 1 year ago

turnkit commented 1 year ago

Write a post-chunk utility to log the file name for each chunked (paragraph added) files and also the stats below.

The stats will be used to manually sort in Google Sheet and will provide manual "sanity checks" of the output files.

  1. the character count of the shortest line (number of characters) in a given file
  2. the shortest sentence count paragraph (a line) (a number),
  3. the average (or mean) number of sentences for each paragraph (a number)
  4. the largest sentence count paragraph (i.e. sentence count for the paragraph with the MOST number of sentences) (a number).
  5. the character count of the longest (number of characters) line in the file

These stats should be comma delimited so that the file imports into a spreadsheet for analysis and potential metadata import into the hosting system.

This might be as straightforward as doing a count of periods and question marks in each line, and then looking for the min, avg, and max.

e.g. output: post_chunk_analysis.csv:

TITLE: POST-CHUNK ANALYSIS UTILITY OUTPUT: filename, min_line_char_cnt, min_sent_cnt, mean_sent_cnt, max_sent_cnt, max_line_char_cnt SID0004.mp3.txt_out.txt, 62, 2, 4.72, 25, 522

possible useful references: