classifier is very slow

keighrim commented 9 months ago

Bug Description

Classifier class has process_video that iterates video frame by frame (with a sample rate) and extract image, feature vectors, run classification. We found that the method is very slow, event on GPU. I did some manual profiling, and I believe we should work on improving video seeking code.

Reproduction steps

With

$ clams source video:/some.mp4 | curl -d@- "localhost:23231?stopAt=120000"  # use first 12 secs only

logging says;

2024-02-08 12:22:25 http://apps.clams.ai/swt-detection/unresolvable INFO     139987732137536 Processing video d1                                                               
2024-02-08 12:22:38 http://apps.clams.ai/swt-detection/unresolvable INFO     139987732137536 Processing took 12.783082723617554 seconds

Hmm, that's real-time. This was measured on a GPU machine, and GPU utilization was confirmed.

So I wrote a quick and simple profiling

    def process_video(self, mp4_file: str) -> list:
        """Loops over the frames in a video and for each frame extracts the features
        and applies the classifier. Returns a list of predictions, where each prediction
        is an instance of numpy.ndarray."""
        featurizing_time = 0
        classifier_time = 0
        extract_time = 0
        seek_time = 0
        if self.debug:
            print(f'Processing {mp4_file}...')
            print(f'Labels: {self.prebin_labels}')
        logging.info(f'processing {mp4_file}...')
        predictions = []
        vidcap = cv2.VideoCapture(mp4_file)
        if not vidcap.isOpened():
            raise IOError(f'Could not open {mp4_file}')
        fps = round(vidcap.get(cv2.CAP_PROP_FPS), 2)
        fc = vidcap.get(cv2.CAP_PROP_FRAME_COUNT)
        dur = round(fc / fps, 3) * 1000
        for ms in range(0, sys.maxsize, self.sample_rate):
            if ms < self.start_at:
                continue
            if ms > self.stop_at:
                break
            t = time.time()
            vidcap.set(cv2.CAP_PROP_POS_MSEC, ms)
            seek_time += time.time() - t
            t = time.time()
            success, image = vidcap.read()
            extract_time += time.time() - t
            if not success:
                break
            img = Image.fromarray(image[:,:,::-1])
            t = time.time()
            features = self.featurizer.get_full_feature_vectors(img, ms, dur)
            featurizing_time += time.time() - t
            t = time.time()
            prediction = self.classifier(features).detach()
            prediction = Prediction(ms, self.prebin_labels, prediction)
            classifier_time += time.time() - t
            if self.debug:
                print(prediction)
            predictions.append(prediction)
        sys.stderr.write(f'Featurizing time: {featurizing_time:.2f} seconds\n')
        sys.stderr.write(f'Classifier time: {classifier_time:.2f} seconds\n')
        sys.stderr.write(f'Extract time: {extract_time:.2f} seconds\n')
        sys.stderr.write(f'Seeking time: {seek_time:.2f} seconds\n')
        return predictions

And with the same curl call, I got

2024-02-10 02:38:13 http://apps.clams.ai/swt-detection/unresolvable INFO     139892458784320 Processing video d1
Featurizing time: 6.11 seconds
Classifier time: 0.05 seconds
Extract time: 0.07 seconds
Seeking time: 5.98 seconds
2024-02-10 02:38:26 http://apps.clams.ai/swt-detection/unresolvable INFO     139892458784320 Processing took 12.812922716140747 seconds

So it turns out "moving cursor" in the video is taking as much time as the complex CNN feature extraction.

The cv2 instance I was using was compiled with ffmpeg, and to my knowledge, seeking to a keyframe is a constant-time operation (correct me if I'm wrong, please).

I can only guess that using POS_MSEC instead of frame number adds a big overhead. Probably further experiment is needed.

Expected behavior

No response

Screenshots

No response

Additional context

https://forum.opencv.org/t/cap-prop-pos-frames-is-abnormal-slow/11651/3 (this is possibly related to #67 too)

video file info

Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/llc_data/clams/wgbh/Peabody/cpb-aacip-526-4b2x34nn45.mp4':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: mp42isom
    creation_time   : 2020-01-22T10:27:50.000000Z
  Duration: 00:33:09.37, start: 0.000000, bitrate: 795 kb/s
  Stream #0:0[0x1](und): Video: h264 (Main) (avc1 / 0x31637661), none(progressive), 480x360, 657 kb/s, SAR 1:1 DAR 4:3, 29.97 fps, 29.97 tbr, 30k tbn (default)
    Metadata:
      creation_time   : 2020-01-22T10:27:50.000000Z
      vendor_id       : TELE
      encoder         : AVC
  Stream #0:1[0x2](und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 128 kb/s (default)
    Metadata:
      creation_time   : 2020-01-22T10:27:50.000000Z
      vendor_id       : [0][0][0][0]

keighrim commented 9 months ago

Another data point;

$ ffmpeg -i /mnt/llc/llc_data/clams/wgbh/NewsHour/WN0015/cpb-aacip-507-154dn40c26.mp4
...
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/mnt/llc/llc_data/clams/wgbh/NewsHour/WN0015/cpb-aacip-507-154dn40c26.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf56.4.101
  Duration: 00:59:25.22, start: 0.000000, bitrate: 830 kb/s
  Stream #0:0[0x1](und): Video: h264 (High) (avc1 / 0x31637661), none(progressive), 480x360, 694 kb/s, 29.97 fps, 29.97 tbr, 60k tbn (default)
    Metadata:
      handler_name    : VideoHandler
      vendor_id       : [0][0][0][0]
  Stream #0:1[0x2](und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 128 kb/s (default)
    Metadata:
      handler_name    : SoundHandler
      vendor_id       : [0][0][0][0]

then, running this video to SWT @ develop, with all default params,

2024-02-10 04:27:52 http://apps.clams.ai/swt-detection/unresolvable INFO     140602549245504 Processing video d1 at /mnt/llc/llc_data/clams/wgbh/NewsHour/WN0015/cpb-aacip-507-154dn40c26.mp4
2024-02-10 04:27:52 http://apps.clams.ai/swt-detection/unresolvable INFO     140602549245504 processing /mnt/llc/llc_data/clams/wgbh/NewsHour/WN0015/cpb-aacip-507-154dn40c26.mp4...
2024-02-10 04:31:37 http://apps.clams.ai/swt-detection/unresolvable INFO     140602549245504 Featurizing time: 166.46 seconds

2024-02-10 04:31:37 http://apps.clams.ai/swt-detection/unresolvable INFO     140602549245504 Classifier time: 1.53 seconds

2024-02-10 04:31:37 http://apps.clams.ai/swt-detection/unresolvable INFO     140602549245504 Extract time: 1.76 seconds

2024-02-10 04:31:37 http://apps.clams.ai/swt-detection/unresolvable INFO     140602549245504 Seeking time: 37.90 seconds

2024-02-10 04:31:37 http://apps.clams.ai/swt-detection/unresolvable INFO     140602549245504 Processing took 224.91390013694763 seconds
2024-02-10 04:31:37 werkzeug INFO     140602549245504 127.0.0.1 - - [10/Feb/2024 04:31:37] "POST /?pretty HTTP/1.1" 200 -

keighrim commented 9 months ago

So for NH videos, it takes up to 5 mins to process the whole (that's RTx12). Does the Peabody video has something special? Does stopAt parameter do something wonky?

marcverhagen commented 9 months ago

Interesting. Here is another data point.

On my 9 year old iMac (no usable GPU) I get the following on a 10 second video:

total time: 11.32s seek time: 0.59s

So that would be less than a second seek time for 10 seconds of video, versus the 6 seconds you see for a 12 second video.

And for what it's worth, at work on my iMac Pro total processing time for a one hour video is about 15 minutes.

Does stopAt parameter do something wonky?

Maybe, but all it should be doing is break out of a loop:

https://github.com/clamsproject/app-swt-detection/blob/e399e92e6fce805bb3e41bdc47b47f5f40738569/modeling/classify.py#L97-L100

keighrim commented 9 months ago

Another data point from a fresh experiment.

2024-02-29 19:02:25,869 - INFO - Initiating classifier with /home/krim/Projects/clams/clamsproject/app-swt-detection/modeling/models/20240212-131937.convnext_tiny.kfold_000.pt
2024-02-29 19:02:26,655 - INFO - Classifier initiation took 0.7860872745513916 seconds
2024-02-29 19:02:26,656 - INFO - Minimum time frame score: 0.5
2024-02-29 19:02:26,657 - INFO - Processing video d1 at /home/krim/cpb-aacip-259-c53f255d.mp4
2024-02-29 19:04:06,317 - INFO - Featurizing time: 35.16 seconds

2024-02-29 19:04:06,328 - INFO - Classifier time: 1.20 seconds

2024-02-29 19:04:06,329 - INFO - Extract time: 0.87 seconds

2024-02-29 19:04:06,330 - INFO - Seeking time: 53.57 seconds

2024-02-29 19:04:06,331 - INFO - Processing took 99.6482183933258 seconds
2024-02-29 19:04:07,520 - INFO - Initiating classifier with /home/krim/Projects/clams/clamsproject/app-swt-detection/modeling/models/20240212-132306.convnext_lg.kfold_000.pt
2024-02-29 19:04:11,093 - INFO - Classifier initiation took 3.5734410285949707 seconds
2024-02-29 19:04:11,097 - INFO - Minimum time frame score: 0.5
2024-02-29 19:04:11,098 - INFO - Processing video d1 at /home/krim/cpb-aacip-259-c53f255d.mp4
2024-02-29 19:06:43,625 - INFO - Featurizing time: 87.94 seconds

2024-02-29 19:06:43,626 - INFO - Classifier time: 1.26 seconds

2024-02-29 19:06:43,627 - INFO - Extract time: 0.87 seconds

2024-02-29 19:06:43,628 - INFO - Seeking time: 53.59 seconds

2024-02-29 19:06:43,629 - INFO - Processing took 152.51135778427124 seconds

Ran on a desktop GPU machine. For this file (cpb-aacip-259-c53f255d), seeking time still is a significant portion of the runtime both with tiny and lg backbones. In term of CNN application time, tiny and lg models took 35 secs, 88 secs, respectively.

I will keep experimenting with other files (newhours and other videos @owencking shared ealier this week)

keighrim commented 8 months ago

Ran some additional experiments over the weekend, and here are the results;

https://gist.github.com/keighrim/fac746647af02c039c1dbcca2a27db6d

Some findings from simple eyeballing over the numbers;

seeking time very much varied on the "new" 8 videos, while on NH videos seeking took roughly consistently ~35-40 secs per 60-min video
feature extraction time grows roughly linear to the total number of frames.
image extraction and classification time compose very small portion of the total times.

keighrim commented 8 months ago

With a new implementation of frame image extraction in #86 , (although I haven't accurately measure the timing for a number of videos yet) the seeking + extraction time for 2 videos I was experimenting with is seen significantly reduced. (50sec > 22sec, 54sec > 24sec)

clamsproject / app-swt-detection