[Feedback]: Enchantment ideas for the multi-audio Crunchyroll (or any video)

Burve commented 8 months ago

Type

Both

Suggestion

I found this project recently because of the DRM drama. Since it works while yt-dlp does not, I have noticed a few issues and fixed them in my custom app. One of the issues that I noticed right away is proper multi-audio support.

I did try only a few shows, and I noticed that sometimes both languages (Eng and Jap, that I tried) downloaded, but only the first language stayed because the lengths of them both are not the same. My assumption (without checking the code) is that if both videos are the same length, then they are merged, but if the length is not the same, then merge silently fails.

In reality, CR videos between languages often have different sizes (except for some of the latest ones), with the reason being that different languages have different logos at the start of the video, and then the rest is fine. Sometimes same is also at the end, but mostly at the beginning.

I got some code (in Python, not typescript) that can combine around 95% of CR videos automagically and 99% with a little manual tweaking (there are always some with extra clips in the middle of the video, and I never figured that out).

If something is of Interest (and I would like to help improve this app to the levels I had my personal app with yt-dlp before DRM), then I am willing to share general logic ideas or even some Python code.

Please let me know if that might be interesting.

P.S. Another big issue I have noticed about problems with the same language double subtitles, but I will create separate ticked about that.

Jaynator495 commented 8 months ago

You can actually use --syncTiming, and for most videos it will sync the audio timing to the video (assuming I remembered to implement it for MPDs). I would like to hear how it was working/the general ideas in your python script though, as it may help to improve the syncTiming flag.

Alternatively, if you want to keep all the videos, that's also a flag (--keepAllVideos). You can also choose to just not download the extra videos with the --dlVideoOnce flag. I definitely recommend checking over the available flags in the documentation: https://github.com/anidl/multi-downloader-nx/blob/master/docs/DOCUMENTATION.md

Burve commented 8 months ago

It is good to know about some commands. I started with GUI, but I need to check CLI also.

Short version of audio sync.

Because audio comes in different languages, I decided that it is hard to do by audio, so I did it all by video. My assumption is - as long as I can match video frames, Audio will be fine.

With that in mind, using ffmpeg, I process the beginning and the end of the video (usually 30, 60 or 120 seconds; longer might be more precise but a little more work). There, for the duration, I extracted keyframes that come with the actual frame number, and it is possible to calculate time from the frame. Then I look for the frameset (2 or 3 frames) that has the same content between both files. For example, if a Japanese video starts with an opening song while an English video starts with the company logo and opening song, then you will eventually find a pair of frames that have the same content from the opening song in both of them. For comparison, I used image hash. After the pair is found in each video, I calculate the difference between them (usually the length of the extra logo, etc.). After this, I repeat the same procedure at the end of the video.

Next step is to compare offset (what was the difference between frame pairs in the beginning and the end). In the ideal case, the offset difference between the start and end is below 0.05s, and it is perfect. Sometimes, it is around 0.8s, and that is still good. In rare cases, it might be > 1s, and that is not good and ideal for the manual check, or sometimes something goes wrong, and a match is not found, or the difference is like the 20s, then nothing will work. For these bad cases, I had an option in the code where I could create a giant JPG image with keyframes and time and check manually with the ability to override the offset.

After all this, just use calculated offset for each "audio" for both audio and subtitles (there is still an issue with subtitles that I have not mentioned, like if you have a Japanese video that comes with English subs and English audio, that also comes with English subs [subs cover signs and other onscreen text, not dialogue])

Technical version I used Claude 3 Opus to convert Python code to TypeScript. Python code also included.

Getting frame data code

```import * as ffmpeg from 'fluent-ffmpeg'; import * as fs from 'fs'; import * as path from 'path'; import * as imagehash from 'image-hash'; import { Image } from 'image-js'; import * as math from 'mathjs'; interface StartFrame { time: number; frame: number; hash: string; } class VideoProcessor { private video_file_path: string; private video_height: number; private video_width: number; private fps: number; private duration: number; private check_seconds: number; private use_third: boolean; private debug: boolean; private temp: string; private language: { code3: string }; private start_frames: StartFrame[]; private end_frames: StartFrame[]; constructor(/* ... */) { // Initialize the properties in the constructor // ... } public async processing_frames(): Promise { console.log("Processing Frames"); const process_time = this.check_seconds; const offset_time = (Math.round(this.duration * 10) / 10 - 120 - process_time); const process = async (duration: number, offset: number, prefix: string): Promise => { // Start time for measuring processing duration const start = Date.now(); // Extract frames from the video using ffmpeg const { stdout, stderr } = await new Promise<{ stdout: Buffer; stderr: string }>((resolve, reject) => { ffmpeg(this.video_file_path) .input(this.video_file_path) .seekInput(offset) .videoFilters([ { filter: 'select', options: 'gt(scene\\,0.1)' }, { filter: 'showinfo' } ]) .output('pipe:') .outputOptions([ '-t', duration.toString(), '-fps_mode', 'vfr', '-frame_pts', '1', '-f', 'rawvideo', '-pix_fmt', 'rgb24' ]) .on('error', (err, stdout, stderr) => { reject(err); }) .on('end', (stdout, stderr) => { resolve({ stdout, stderr }); }) .run(); }); // End time for measuring processing duration const end = Date.now(); if (this.debug) { console.log(`Time to extract frame data: ${end - start}ms`); } // Decode the error output from ffmpeg const err = stderr.toString('utf-8'); // Initialize lists to store timeframes, start frames, and start video frames const timeframes: number[] = []; const start_frames: StartFrame[] = []; const start_video: Uint8Array[] = []; // Convert the raw video data to a Uint8Array const temp_start_video = new Uint8Array(stdout).reduce((resultArray, item, index) => { const chunkIndex = Math.floor(index / (this.video_width * this.video_height * 3)); if (!resultArray[chunkIndex]) { resultArray[chunkIndex] = []; } resultArray[chunkIndex].push(item); return resultArray; }, []); // If using the middle third of the frames, crop the frames accordingly if (this.use_third) { for (const frame of temp_start_video) { const im_pil = await Image.load(frame); const crop_x1 = 0; const crop_x2 = im_pil.width; const crop_y1 = Math.floor(im_pil.height * 0.333); const crop_y2 = Math.floor(im_pil.height * 0.666); const third_image = im_pil.crop({ x: crop_x1, y: crop_y1, width: crop_x2 - crop_x1, height: crop_y2 - crop_y1 }); start_video.push(await third_image.toUint8Array()); } } else { start_video.push(...temp_start_video); } // Extract timeframes from the ffmpeg error output const regex = /pts_time(.*?)duration/g; let match; while ((match = regex.exec(err)) !== null) { const timeStr = match[1].trim(); const time = parseFloat(timeStr); timeframes.push(time); } // Create a list of start frames with their corresponding time, frame number, and image hash for (let i = 0; i < start_video.length; i++) { const time = timeframes[i] + offset; const frame = Math.floor(time * this.fps); const hash = await imagehash.hash(await Image.load(start_video[i]), 8, 'binary'); start_frames.push({ time, frame, hash }); } // If in debug mode, create a collage of the start frames and save it if (this.debug) { const details = path.join(this.temp, `v_${this.language.code3}_${prefix}.png`); if (fs.existsSync(details)) { fs.unlinkSync(details); } const base_image = await Image.load(start_video[0]); const new_height = 320; const new_width = Math.floor(new_height / base_image.height * base_image.width); const horizontal_images = 5; const vertical_images = Math.ceil(start_video.length / horizontal_images); const collage = new Image(new_width * horizontal_images, new_height * vertical_images); let image_id = 0; // Iterate over the start frames and paste them onto the collage for (let y = 0; y < vertical_images; y++) { for (let x = 0; x < horizontal_images; x++) { if (image_id < start_frames.length) { const frame_image = await Image.load(start_video[image_id]); const resized_image = frame_image.resize({ width: new_width, height: new_height }); collage.drawImage(resized_image, { x: x * new_width, y: y * new_height }); collage.drawText(`${start_frames[image_id].time} (${start_frames[image_id].frame})`, { x: x * new_width, y: y * new_height, color: [255, 0, 0] }); image_id++; } } } await collage.save(details); } // Return the list of start frames return start_frames; }; // Process the start frames this.start_frames = await process(process_time, 0, 'start'); // Process the end frames this.end_frames = await process(process_time, offset_time, 'end'); } } ```

Getting frame data code Python version

``` def processing_frames(self): print(f"Processing Frames") process_time = self.check_seconds offset_time = (round(self.duration, 4) - 120 - process_time) def process(duration, offset, prefix): # Start time for measuring processing duration start = time.time() # Extract frames from the video using ffmpeg out, err = ( ffmpeg .input(self.video_file_path, ss=offset) .filter('select', 'gt(scene, 0.1)') .filter('showinfo') .output('pipe:', t=duration, fps_mode='vfr', frame_pts=True, format='rawvideo', pix_fmt='rgb24') .run(capture_stdout=True, capture_stderr=True) ) # End time for measuring processing duration end = time.time() if self.debug: print("Time to extra frame data: {}".format(end - start)) # Decode the error output from ffmpeg err = err.decode("utf-8") # Initialize lists to store timeframes, start frames, and start video frames timeframes = [] start_frames = [] start_video = [] # Convert the raw video data to a numpy array temp_start_video = ( np .frombuffer(out, np.uint8) .reshape([-1, self.video_height, self.video_width, 3]) ) # If using the middle third of the frames, crop the frames accordingly if self.use_third: for frame in temp_start_video: im_pil = Image.fromarray(frame) crop_x1 = 0 crop_x2 = im_pil.width crop_y1 = int(im_pil.height * 0.333) crop_y2 = int(im_pil.height * 0.666) third_image = im_pil.crop((crop_x1, crop_y1, crop_x2, crop_y2)) start_video.append(np.array(third_image)) else: start_video = temp_start_video # Extract timeframes from the ffmpeg error output for line in iter(err.splitlines()): if 'pts_time' in line and 'duration' in line: timeframes.append( float(re.findall("\d+(?:\.\d+)?", re.search('pts_time(.*)duration', line).group(1))[0])) # Create a list of start frames with their corresponding time, frame number, and image hash for i in range(len(start_video)): start_frames.append({'time': timeframes[i] + offset, 'frame': int((timeframes[i] + offset) * self.fps), 'hash': imagehash.phash(Image.fromarray(start_video[i]))}) # If in debug mode, create a collage of the start frames and save it if self.debug: details = f'{self.temp}{os.path.sep}v_{self.language.code3}_{prefix}.png' if os.path.exists(details): os.remove(details) base_image = Image.fromarray(start_video[0]) new_height = 320 new_width = int(new_height / base_image.height * base_image.width) horizontal_images = 5 vertical_images = math.ceil(len(start_video) / horizontal_images) collage = Image.new("RGBA", (new_width * horizontal_images, new_height * vertical_images)) image_id = 0 font = ImageFont.truetype(r"C:\Windows\Fonts\arial.ttf", 24) draw = ImageDraw.Draw(collage) # Iterate over the start frames and paste them onto the collage for y in range(vertical_images): for x in range(horizontal_images): if len(start_frames) > image_id: collage.paste(Image.fromarray(start_video[image_id]).convert("RGBA").resize((new_width, new_height)), (x * new_width, y * new_height)) draw.text((x * new_width, y * new_height), text=f'{start_frames[image_id]["time"]} ({start_frames[image_id]["frame"]})', font=font, allign="left", fill="red") image_id += 1 collage.save(details) # Return the list of start frames return start_frames # Process the start frames self.start_frames = process(process_time, 0, 'start') # Process the end frames self.end_frames = process(process_time, offset_time, 'end') ```

Offset calculation in the provided code works as checking between current video and provided. So, if provided is the same as current, then nothing is calculated.

Comparing frame data

``` interface FrameData { hash: number; time: number; } interface VideoData { start_frames: FrameData[]; end_frames: FrameData[]; language: { english_name: string; }; threshold: number; forced_offset: boolean; forced_offset_value: number; } function calculate_offset(base_data: VideoData): void { /** * Calculate the offset between base_data and self. * @param base_data The base video data to compare against. */ if (base_data === this) { return; } function compare_frames(base_frames: FrameData[], current_frames: FrameData[], reverse = false): [boolean, number] { /** * Compare frames between base_frames and current_frames to find the offset. * @param base_frames Frames from the base video data. * @param current_frames Frames from the current video data. * @param reverse Whether to compare frames in reverse order. * @return A tuple indicating if an offset is found and the offset value. */ let have_offset = false; let offset = 0; const hash_threshold = 1; // Iterate over frames in the specified order for (let i = reverse ? base_frames.length - 2 : 0; reverse ? i >= 0 : i < base_frames.length - 1; reverse ? i-- : i++) { const base_index = i; const base_second_index = i + 1; let base_check_value = 64; let pair_check_value = 64; let pair_index = 0; let pair_second_index = 0; // Compare hash values of frames between base_frames and current_frames for (let j = reverse ? current_frames.length - 1 : 0; reverse ? j >= 0 : j < current_frames.length; reverse ? j-- : j++) { let hash_diff = base_frames[base_index].hash - current_frames[j].hash; if (hash_diff < hash_threshold && hash_diff < base_check_value) { base_check_value = hash_diff; pair_index = j; } hash_diff = base_frames[base_second_index].hash - current_frames[j].hash; if (hash_diff < hash_threshold && hash_diff < pair_check_value) { pair_check_value = hash_diff; pair_second_index = j; } } // Check if consecutive frames match if (pair_index + 1 === pair_second_index) { have_offset = true; break; } } if (have_offset) { offset = base_frames[base_index].time - current_frames[pair_index].time; } return [have_offset, offset]; } // Calculate offset using start frames if not forced if (!this.forced_offset) { [this.have_offset, this.offset] = compare_frames(base_data.start_frames, this.start_frames, true); // Calculate offset using end frames and check tolerance const [have_end_offset, end_offset] = compare_frames(base_data.end_frames, this.end_frames, true); const end_tolerance = this.threshold; const check_end = Math.abs(Math.abs(this.offset) - Math.abs(end_offset)) < end_tolerance; this.have_offset = this.have_offset && check_end; this.offset = Math.round(this.offset * 10000000) / 10000000; console.log(`${this.language.english_name} offset ${this.offset} is used - ${this.have_offset ? "\x1b[32m" : "\x1b[31m"}${this.have_offset}${this.have_offset ? "\x1b[0m" : "\x1b[0m"}. End tolerance ${Math.abs(Math.abs(this.offset) - Math.abs(end_offset))} (needed ${end_tolerance} to pass)`); } else { // Use forced offset value this.have_offset = true; this.offset = this.forced_offset_value; console.log(`Using \x1b[34mForced offset\x1b[0m \x1b[33m${this.offset}\x1b[0m`); } } ```

Comparing frame data in Python

``` def calculate_offset(self, base_data): """ Calculate the offset between base_data and self. :param VideoData base_data: The base video data to compare against. """ if base_data == self: return def compare_frames(base_frames, current_frames, reverse=False): """ Compare frames between base_frames and current_frames to find the offset. :param base_frames: Frames from the base video data. :param current_frames: Frames from the current video data. :param reverse: Whether to compare frames in reverse order. :return: A tuple indicating if an offset is found and the offset value. """ have_offset = False offset = 0 hash_threshold = 1 # Iterate over frames in the specified order for i in reversed(range(len(base_frames) - 1)) if reverse else range(len(base_frames) - 1): base_index = i base_second_index = i + 1 base_check_value = 64 pair_check_value = 64 pair_index = 0 pair_second_index = 0 # Compare hash values of frames between base_frames and current_frames for j in reversed(range(len(current_frames))) if reverse else range(len(current_frames)): hash_diff = base_frames[base_index]['hash'] - current_frames[j]['hash'] if hash_diff < hash_threshold and hash_diff < base_check_value: base_check_value = hash_diff pair_index = j hash_diff = base_frames[base_second_index]['hash'] - current_frames[j]['hash'] if hash_diff < hash_threshold and hash_diff < pair_check_value: pair_check_value = hash_diff pair_second_index = j # Check if consecutive frames match if pair_index + 1 == pair_second_index: have_offset = True break if have_offset: offset = base_frames[base_index]['time'] - current_frames[pair_index]['time'] return have_offset, offset # Calculate offset using start frames if not forced if not self.forced_offset: self.have_offset, self.offset = compare_frames(base_data.start_frames, self.start_frames, True) # Calculate offset using end frames and check tolerance have_end_offset, end_offset = compare_frames(base_data.end_frames, self.end_frames, True) end_tolerance = self.threshold check_end = abs(abs(self.offset) - abs(end_offset)) < end_tolerance self.have_offset = self.have_offset and check_end self.offset = round(self.offset, 7) console.print(f'{self.language.english_name} offset {self.offset} is used - ' f'{"\[green\]" if self.have_offset else "\[red\]"}{self.have_offset}' f'{"\[/green\]" if self.have_offset else "\[/red\]"}. ' f'End tolerance {abs(abs(self.offset) - abs(end_offset))} (needed {end_tolerance} to pass)') else: # Use forced offset value self.have_offset = True self.offset = self.forced_offset_value console.print(f'Using \[blue\]Forced offset\[/blue\] \[yellow\]{self.offset}\[/yellow\]') ```

I am sure I forgot some details about the code and will be happy to answer anything.

anidl / multi-downloader-nx

[Feedback]: Enchantment ideas for the multi-audio Crunchyroll (or any video) #599

Type

Suggestion