clamsproject / app-swt-detection

CLAMS app for detecting scenes with text from video input
Apache License 2.0
1 stars 0 forks source link

classifier is very slow #69

Open keighrim opened 7 months ago

keighrim commented 7 months ago

Bug Description

Classifier class has process_video that iterates video frame by frame (with a sample rate) and extract image, feature vectors, run classification. We found that the method is very slow, event on GPU. I did some manual profiling, and I believe we should work on improving video seeking code.

Reproduction steps

With

$ clams source video:/some.mp4 | curl -d@- "localhost:23231?stopAt=120000"  # use first 12 secs only

logging says;

2024-02-08 12:22:25 http://apps.clams.ai/swt-detection/unresolvable INFO     139987732137536 Processing video d1                                                               
2024-02-08 12:22:38 http://apps.clams.ai/swt-detection/unresolvable INFO     139987732137536 Processing took 12.783082723617554 seconds

Hmm, that's real-time. This was measured on a GPU machine, and GPU utilization was confirmed.

So I wrote a quick and simple profiling

    def process_video(self, mp4_file: str) -> list:
        """Loops over the frames in a video and for each frame extracts the features
        and applies the classifier. Returns a list of predictions, where each prediction
        is an instance of numpy.ndarray."""
        featurizing_time = 0
        classifier_time = 0
        extract_time = 0
        seek_time = 0
        if self.debug:
            print(f'Processing {mp4_file}...')
            print(f'Labels: {self.prebin_labels}')
        logging.info(f'processing {mp4_file}...')
        predictions = []
        vidcap = cv2.VideoCapture(mp4_file)
        if not vidcap.isOpened():
            raise IOError(f'Could not open {mp4_file}')
        fps = round(vidcap.get(cv2.CAP_PROP_FPS), 2)
        fc = vidcap.get(cv2.CAP_PROP_FRAME_COUNT)
        dur = round(fc / fps, 3) * 1000
        for ms in range(0, sys.maxsize, self.sample_rate):
            if ms < self.start_at:
                continue
            if ms > self.stop_at:
                break
            t = time.time()
            vidcap.set(cv2.CAP_PROP_POS_MSEC, ms)
            seek_time += time.time() - t
            t = time.time()
            success, image = vidcap.read()
            extract_time += time.time() - t
            if not success:
                break
            img = Image.fromarray(image[:,:,::-1])
            t = time.time()
            features = self.featurizer.get_full_feature_vectors(img, ms, dur)
            featurizing_time += time.time() - t
            t = time.time()
            prediction = self.classifier(features).detach()
            prediction = Prediction(ms, self.prebin_labels, prediction)
            classifier_time += time.time() - t
            if self.debug:
                print(prediction)
            predictions.append(prediction)
        sys.stderr.write(f'Featurizing time: {featurizing_time:.2f} seconds\n')
        sys.stderr.write(f'Classifier time: {classifier_time:.2f} seconds\n')
        sys.stderr.write(f'Extract time: {extract_time:.2f} seconds\n')
        sys.stderr.write(f'Seeking time: {seek_time:.2f} seconds\n')
        return predictions

And with the same curl call, I got

2024-02-10 02:38:13 http://apps.clams.ai/swt-detection/unresolvable INFO     139892458784320 Processing video d1
Featurizing time: 6.11 seconds
Classifier time: 0.05 seconds
Extract time: 0.07 seconds
Seeking time: 5.98 seconds
2024-02-10 02:38:26 http://apps.clams.ai/swt-detection/unresolvable INFO     139892458784320 Processing took 12.812922716140747 seconds

So it turns out "moving cursor" in the video is taking as much time as the complex CNN feature extraction.

The cv2 instance I was using was compiled with ffmpeg, and to my knowledge, seeking to a keyframe is a constant-time operation (correct me if I'm wrong, please).

I can only guess that using POS_MSEC instead of frame number adds a big overhead. Probably further experiment is needed.

Expected behavior

No response

Screenshots

No response

Additional context

https://forum.opencv.org/t/cap-prop-pos-frames-is-abnormal-slow/11651/3 (this is possibly related to #67 too)

video file info

Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/llc_data/clams/wgbh/Peabody/cpb-aacip-526-4b2x34nn45.mp4':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: mp42isom
    creation_time   : 2020-01-22T10:27:50.000000Z
  Duration: 00:33:09.37, start: 0.000000, bitrate: 795 kb/s
  Stream #0:0[0x1](und): Video: h264 (Main) (avc1 / 0x31637661), none(progressive), 480x360, 657 kb/s, SAR 1:1 DAR 4:3, 29.97 fps, 29.97 tbr, 30k tbn (default)
    Metadata:
      creation_time   : 2020-01-22T10:27:50.000000Z
      vendor_id       : TELE
      encoder         : AVC
  Stream #0:1[0x2](und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 128 kb/s (default)
    Metadata:
      creation_time   : 2020-01-22T10:27:50.000000Z
      vendor_id       : [0][0][0][0]
keighrim commented 7 months ago

Another data point;

$ ffmpeg -i /mnt/llc/llc_data/clams/wgbh/NewsHour/WN0015/cpb-aacip-507-154dn40c26.mp4
...
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/mnt/llc/llc_data/clams/wgbh/NewsHour/WN0015/cpb-aacip-507-154dn40c26.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf56.4.101
  Duration: 00:59:25.22, start: 0.000000, bitrate: 830 kb/s
  Stream #0:0[0x1](und): Video: h264 (High) (avc1 / 0x31637661), none(progressive), 480x360, 694 kb/s, 29.97 fps, 29.97 tbr, 60k tbn (default)
    Metadata:
      handler_name    : VideoHandler
      vendor_id       : [0][0][0][0]
  Stream #0:1[0x2](und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 128 kb/s (default)
    Metadata:
      handler_name    : SoundHandler
      vendor_id       : [0][0][0][0]

then, running this video to SWT @ develop, with all default params,

2024-02-10 04:27:52 http://apps.clams.ai/swt-detection/unresolvable INFO     140602549245504 Processing video d1 at /mnt/llc/llc_data/clams/wgbh/NewsHour/WN0015/cpb-aacip-507-154dn40c26.mp4
2024-02-10 04:27:52 http://apps.clams.ai/swt-detection/unresolvable INFO     140602549245504 processing /mnt/llc/llc_data/clams/wgbh/NewsHour/WN0015/cpb-aacip-507-154dn40c26.mp4...
2024-02-10 04:31:37 http://apps.clams.ai/swt-detection/unresolvable INFO     140602549245504 Featurizing time: 166.46 seconds

2024-02-10 04:31:37 http://apps.clams.ai/swt-detection/unresolvable INFO     140602549245504 Classifier time: 1.53 seconds

2024-02-10 04:31:37 http://apps.clams.ai/swt-detection/unresolvable INFO     140602549245504 Extract time: 1.76 seconds

2024-02-10 04:31:37 http://apps.clams.ai/swt-detection/unresolvable INFO     140602549245504 Seeking time: 37.90 seconds

2024-02-10 04:31:37 http://apps.clams.ai/swt-detection/unresolvable INFO     140602549245504 Processing took 224.91390013694763 seconds
2024-02-10 04:31:37 werkzeug INFO     140602549245504 127.0.0.1 - - [10/Feb/2024 04:31:37] "POST /?pretty HTTP/1.1" 200 -
keighrim commented 7 months ago

So for NH videos, it takes up to 5 mins to process the whole (that's RTx12). Does the Peabody video has something special? Does stopAt parameter do something wonky?

marcverhagen commented 7 months ago

Interesting. Here is another data point.

On my 9 year old iMac (no usable GPU) I get the following on a 10 second video:

total time: 11.32s seek time: 0.59s

So that would be less than a second seek time for 10 seconds of video, versus the 6 seconds you see for a 12 second video.

And for what it's worth, at work on my iMac Pro total processing time for a one hour video is about 15 minutes.

Does stopAt parameter do something wonky?

Maybe, but all it should be doing is break out of a loop:

https://github.com/clamsproject/app-swt-detection/blob/e399e92e6fce805bb3e41bdc47b47f5f40738569/modeling/classify.py#L97-L100

keighrim commented 7 months ago

Another data point from a fresh experiment.

2024-02-29 19:02:25,869 - INFO - Initiating classifier with /home/krim/Projects/clams/clamsproject/app-swt-detection/modeling/models/20240212-131937.convnext_tiny.kfold_000.pt
2024-02-29 19:02:26,655 - INFO - Classifier initiation took 0.7860872745513916 seconds
2024-02-29 19:02:26,656 - INFO - Minimum time frame score: 0.5
2024-02-29 19:02:26,657 - INFO - Processing video d1 at /home/krim/cpb-aacip-259-c53f255d.mp4
2024-02-29 19:04:06,317 - INFO - Featurizing time: 35.16 seconds

2024-02-29 19:04:06,328 - INFO - Classifier time: 1.20 seconds

2024-02-29 19:04:06,329 - INFO - Extract time: 0.87 seconds

2024-02-29 19:04:06,330 - INFO - Seeking time: 53.57 seconds

2024-02-29 19:04:06,331 - INFO - Processing took 99.6482183933258 seconds
2024-02-29 19:04:07,520 - INFO - Initiating classifier with /home/krim/Projects/clams/clamsproject/app-swt-detection/modeling/models/20240212-132306.convnext_lg.kfold_000.pt
2024-02-29 19:04:11,093 - INFO - Classifier initiation took 3.5734410285949707 seconds
2024-02-29 19:04:11,097 - INFO - Minimum time frame score: 0.5
2024-02-29 19:04:11,098 - INFO - Processing video d1 at /home/krim/cpb-aacip-259-c53f255d.mp4
2024-02-29 19:06:43,625 - INFO - Featurizing time: 87.94 seconds

2024-02-29 19:06:43,626 - INFO - Classifier time: 1.26 seconds

2024-02-29 19:06:43,627 - INFO - Extract time: 0.87 seconds

2024-02-29 19:06:43,628 - INFO - Seeking time: 53.59 seconds

2024-02-29 19:06:43,629 - INFO - Processing took 152.51135778427124 seconds

Ran on a desktop GPU machine. For this file (cpb-aacip-259-c53f255d), seeking time still is a significant portion of the runtime both with tiny and lg backbones. In term of CNN application time, tiny and lg models took 35 secs, 88 secs, respectively.

I will keep experimenting with other files (newhours and other videos @owencking shared ealier this week)

keighrim commented 7 months ago

Ran some additional experiments over the weekend, and here are the results;

https://gist.github.com/keighrim/fac746647af02c039c1dbcca2a27db6d

Some findings from simple eyeballing over the numbers;

  1. seeking time very much varied on the "new" 8 videos, while on NH videos seeking took roughly consistently ~35-40 secs per 60-min video
  2. feature extraction time grows roughly linear to the total number of frames.
  3. image extraction and classification time compose very small portion of the total times.
keighrim commented 6 months ago

With a new implementation of frame image extraction in #86 , (although I haven't accurately measure the timing for a number of videos yet) the seeking + extraction time for 2 videos I was experimenting with is seen significantly reduced. (50sec > 22sec, 54sec > 24sec)