clamsproject / mmif

MultiMedia Interchange Format
Apache License 2.0
5 stars 1 forks source link

revisit time-related vocabulary #192

Open keighrim opened 1 year ago

keighrim commented 1 year ago

Currently, we allow seconds, milliseconds, and frames as possible values of timeUnit metadata property. We might want to re-visit this metadata property so that

  1. we can add more unit types such as ISO8601 or other timecode formats.
  2. we can be more specific about the possible values of timeUnit in the JSON map (by enumerating all possible values explicitly in the vocabulary, instead of having a generic string type
keighrim commented 1 year ago

Another issue with the current usage of timeUnit raised by @owencking over s slack chat.


Owen King 15:57 Sometimes you use frames. Other times it's milliseconds. How do you choose?

Keigh Rim 15:57 arbitrarily. up to developer’s decision.

(...)

Owen King 16:01 Suppose I want to ensure that we can return to and reproduce that exact frame repeatedly. Does it matter whether we identify it by frame number or millisecond?

Keigh Rim 16:06 Well, as long as the video file is the same, I don’t believe the math will be different.

(...)

Owen King 16:10 Well, I was wondering if different software might do the math differently. Frame number should be unambiguous. But maybe for millisecond some will choose the frame before and some will chose the frame after?

(...)

Owen King 17:08 I can tell you what was on my mind. Over at GBH we're talking about whether it's better to save image files of successfully identified frames of interest, or just to keep the timestamps for the relevant frames (as in MMIF annotations). I was trying to figure out if there were downsides to not keeping any image files and always assuming we can extract the relevant stills on an as-needed basis. Just trying to think through the pros and cons of the two approaches.


To me, the primary sources of the problem here is that

  1. Currently, MMIF doesn't specify nor validate the "number"-types values in discussion. For example, even when a dev say the unit for the TimeFrame is seconds, it could still have any fraction below the decimal point. For reference, ISO standard doesn't specify the limit below the DP, but python implementation will use microsecond (not milli). Provided the worst case, where the developer rounded-up or rounded-down all seconds to ones and made them all natural (not "rounding", but "ceiling" or "flooring"), we should expect up to 29 frame difference between actually frame number and the second number converted back to a frame number.

  2. Different software actually do the math differently. In addition to that computers are really bad at doing math the fractional numbers precisely (to be fair, they are really, really, really good at doing it very quickly). For reference, for "29.97" NTSC-based videos, ffmpeg will tell you the FPS is 29.97, while opencv pushes to to the limit of the double FP and gives you 29.97002997002997. This difference will result in up to 13 frames different given a ~4-hr long 29.97-fps video (or 240 60 30 total frames), or up to 4 frames difference in 1-hr videos.

$ ffmpeg -i cpb-aacip-77-074tnfhr.mp4 2>&1 | grep -i fps
  Stream #0:0[0x1](und): Video: h264 (Main) (avc1 / 0x31637661), none(progressive), 480x360, 659 kb/s, SAR 1:1 DAR 4:3, 29.97 fps, 29.97 tbr, 30k tbn (default)
$ python -c 'import cv2; print(cv2.VideoCapture("cpb-aacip-77-074tnfhr.mp4").get(cv2.CAP_PROP_FPS))' 2>/dev/null
# (...)
29.97002997002997
 $ python -c 'a=29.97; b=29.97002997002997; print(set([abs(int(i * a) - int(i * b)) for i in range(240*60*30) if int(i * a) != int(i * b)]))'
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}

The second point is less of a problem, and we can probably live with 3-4 frame differences for 1-hr videos it. We can also somewhat mitigate this by forcing rounded-to-hundredths fps (consistently 29.97) in the SDK code (which I already did last week in the develop branch), hoping the devs are lazy enough not to implement their own fps calculation. The real problem is the first one, where the developer can choose to use any number below the DP, and the MMIF spec doesn't validate it. As Owen pointed out, frame numbers are unambiguous and concise. But we can't force the usage of frame nums because timeUnit is not just for videos, but also for audios (and alignment of the two, as in forced aligner apps). A few options I can think of to mitigate this;

  1. drop seconds and always use milliseconds or microseconds as integers. This will reduce inaccuracy coming from FP calculation, as well as the conversion. (with 29.97 fps, duration of 1 frame is ~3337 microseconds. This is a huge tolerance compared to what we have now with usage of seconds.)
  2. use frame num as "the" standard way of encoding times and adopt some SMPTE standard timecode format for audio documents, if there's actually one and only standard way of doing this conversion.

In any case, I strong suggest we should drop using seconds either because of its coarse (or even crude) granularity or because of its unrestricted decimal power.

@kelleyl , @snewman-aa you guys have plenty experience with video-related CLAMS apps, and I'd like to hear your thoughts on this.

snewman-aa commented 1 year ago

I see no reason not to drop seconds and I'm not even sure why we would need to continue supporting milliseconds either. I think I might prefer the second option of only using frame num to encode times. A lot of apps use models that ultimately have frame num output that we then convert to milliseconds in the MMIF output only to then convert them back to frames for the model in the next app.

I think this unit conversion may have been the number one source of bugs in my apps so far, so I definitely wouldn't mind removing it all together, though the conversion utils are still convenient for development.

keighrim commented 1 year ago

Today I Learned: frame count numbers are not unambiguous. There is nb_frames, and then nb_read_frames in the MPEG world, and opencv either uses a possibly inaccurate nb_frames value or just estimates using fps * duration math. https://stackoverflow.com/questions/31472155

The second proposal works better for videos, but I don't think there's a universal way to convert frame nums to audio-compatible timecode in reliable way.

keighrim commented 1 year ago

The more I'm thinking about this, the more I'm leaning toward using milli- or microsecond as the only timeunit, and make all the time-related values typed as int. A few advantages;

  1. all start and end values in annotation.property are then single-typed (charter offsets and time points), and we can atually validate them in the json schema (although, this means we will have a "breaking" new version).
  2. second (full or fractional) is "universal" time unit for human, so human readers can use commonsense to read the values of time-related annotation, instead of the math with arbitrary FPS and frame numbers.
  3. most timecodes look like HH:MM:SS.fff, which is not a number, and hence will add some burden to converter code pieces to parse some string values into numbers before going into the math (but I admit that timecodes might be the most intuitive notation for human readers).
  4. I don't think timecodes can be obtained from audiodocuments alone, because lack of the concept of "frames" in the audio stream.
keighrim commented 1 year ago

A (small) problem with using microseconds as the only timeunit: 1 microsecond is a millionth of a second, meaning 1 hour is 3,600,000,000 microsecond. Since the maximum "simple" signed int on 32-bit systems is 2^31 - 1 = 2,147,483,648, on these systems simple integer data type only works only up to ~35 min long media (~70 min with unsigned int).

This is indeed a very "small" problem considering;

  1. Python, the primary programming language we use in CLAMS project, doesn't have a "simple" int type and all int data can be arbitrarily long integers.
  2. Seriously, who uses a 32-bit computer to run DNN-based ML/AI software (such as CLAMS)?
keighrim commented 1 year ago

We (@jarumihooi @owencking and me) talked about this issue in yesterday's meeting, and we (in general) agreed to use either integer milli or micro-seconds as THE standard time unit to appear in all MMIF output of future versions of CLAMS apps. I just wonder if that decision would cause any trouble in terms of GBH's downstream processing workflow (pbcore conversion, relational db operations, etc), so summoning @mrharpo @afred @foo4thought .

keighrim commented 1 year ago

Another "feature" that can be helpful for human readers of MMIF is having altStart and altEnd fields along with start/end to hold secondary, more human-readable format. Having alternative anchors could also benefit other anchors types, for example, holding unicode codepoint counts for text data. Discussion around alt-anchors would probably desire its own issue.

jarumihooi commented 1 year ago

At the moment, it seems we've reached a consensus of going forward with all gold (and mmif??) datasets using the ISO 8601 standard hr:mn:sc.msc with a DOT.

I forgot to add the reasoning about audio not having frames to the aapb-annotations/readme.md but otherwise I think this could be closed until further business needs require revisiting this or more precision digits.

jarumihooi commented 1 year ago

Note, while a decision was made, raws and golds currently do not follow the convention yet, as that requires changing both tools and dataset for compatibility.

keighrim commented 1 year ago

I'll keep this open until we

For data in the aapb-annotations repo, we should probably open new issues to keep track of required updates and updating progress of process.py and golds files.

jarumihooi commented 1 year ago

Currently do all the outputs from apps to mmif record this in the ISO Time Format? I believe that is the higher concern on the issue as it relates to older aapb-annotations repo projects.

Into the future, if its an in-house anno tool, it should use this format where possible. if an out-of-house annotation tool or it is not possible, then process.pys should convert the time format when converting to gold. (Is it worth it to convert the raw as well before it is stored in the aapb-annotations repo? Simply to reduce uniqueness? Apps should then all expect ISO Time format and output that way too.

As for the previous projects, reconverting the time format is likely a lot of work and may cause troubles with apps or currently working code to fail. How do we consider if this is worth doing or not?

keighrim commented 1 year ago

Ideally, all the "currently" working evaluation code should pull the gold data by permalinks. So unless we re-write aabp-annotations git history entirely, they should work in the future, regardless how we change the format of existing (old) project files.

That said, obviously none of current clams apps uses ISO format for time notation, since officially, only valid time units are seconds, ms, frame numbers (https://mmif.clams.ai/vocabulary/Region/v1/), and the specification hasn't been updated based on the above discussion.

What I'm not very sure about implementing this is the altStart and altEnd fields. Since start and end fields in MMIF have a very special meaning and they are used not just for time-based annotations, but all sorts of other annotation types, allowing alt fields with another special meaning only for time-related annotation types just doesn't make a lot of sense to me. And giving different special meanings to other start and end (e.g., character offsets for text-based annotations) will require quite long discussion that we can't afford now.

Another aspect of the problem we need to consider is that MMIF is NOT INTENDED FOR HUMAN READERS. For human readers, we have mmif-viz. Hence, if we need some "human-friendly feature" such as ISO time notation, probably mmif-viz is the right place to implement it, not within mmit-python SDK.

So going back to the other questions,

Apps should then all expect ISO Time format and output that way too.

I don't think so. App should all expect integer-valued start and end in milliseconds in the future, and should output integers are the same granularity. But they can output the ISO way along with the integers. Apps can use altStart and altEnd to output ISO format, but that must be entirely optional. Thus, apps must not expect or rely on ISO format especially not in start/end fields.

if its an in-house anno tool, it should use this format where possible. ... process.pys should convert the time format when converting to gold.

Totally agree that any future in-house annotation environment (for human annotators) must use ISO time format. Not quite sure about the second part, though. Again, gold files are mostly for machine consumption. We can add ISO format as human-friendly columns next to machine-friendly start/end columns in the "gold" format. And writing that conversion (just once) in the process.py shouldn't be that hard. On the other hand, omitting machine-friendly columns and having all the downstream software that want to use the gold data to convert ISO strings to actual numbers on their own seems almost sinful decision to me.

Is it worth it to convert the raw as well before it is stored in the aapb-annotations repo?

I guess this will completely depends on how the annotation tool, and "upload" steps are implemented. If you're referring existing raw files from older project, I wouldn't bother re-formatting them.

As for the previous projects, reconverting the time format is likely a lot of work and may cause troubles with apps or currently working code to fail. How do we consider if this is worth doing or not?

Now I'm not sure in this part you were referring to "raw" or "gold" files of older project. However, it will be worth re-formatting "gold" files because we will probably re-using them in the future over and over (old versions of "gold" files are still accessible via git history and github permalinks, so no worry about losing anything). So having those files up-to-date with what we decide should be worth the effort, as long as the gold data can be used for any purpose (evaluation, mostly).

keighrim commented 10 months ago

Just for future reference, all "gold" files for older annotation project (2022-23 season) are now updated to use the ISO time format via https://github.com/clamsproject/aapb-annotations/pull/77.