Open turnkit opened 1 year ago
Write a post-chunk utility to log the file name for each chunked (paragraph added) files and also the stats below.
The stats will be used to manually sort in Google Sheet and will provide manual "sanity checks" of the output files.
These stats should be comma delimited so that the file imports into a spreadsheet for analysis and potential metadata import into the hosting system.
This might be as straightforward as doing a count of periods and question marks in each line, and then looking for the min, avg, and max.
TITLE: POST-CHUNK ANALYSIS UTILITY OUTPUT: filename, min_line_char_cnt, min_sent_cnt, mean_sent_cnt, max_sent_cnt, max_line_char_cnt SID0004.mp3.txt_out.txt, 62, 2, 4.72, 25, 522
possible useful references:
Write a post-chunk utility to log the file name for each chunked (paragraph added) files and also the stats below.
The stats will be used to manually sort in Google Sheet and will provide manual "sanity checks" of the output files.
These stats should be comma delimited so that the file imports into a spreadsheet for analysis and potential metadata import into the hosting system.
This might be as straightforward as doing a count of periods and question marks in each line, and then looking for the min, avg, and max.
e.g. output: post_chunk_analysis.csv:
TITLE: POST-CHUNK ANALYSIS UTILITY OUTPUT: filename, min_line_char_cnt, min_sent_cnt, mean_sent_cnt, max_sent_cnt, max_line_char_cnt SID0004.mp3.txt_out.txt, 62, 2, 4.72, 25, 522
possible useful references: