Finding solutions to deal with the many problems with exporting the dataset

The problems

Currently, every image for the dataset is output using the method described here. However, this export method comes with various issues:

We export as RGB. This means that we MUST produce a YUV 4:4:4 image to convert to RGB somehow.
We use lanczos to upscale the chroma to the luma resolution. Lanczos, while generally speaking the most preferable convolution-based scaler for upscaling video during playback, is not preferable when you want to train models for video restoration purposes. In an ideal scenario, we want to avoid upsampling if at all possible.
Our dataset will wind up fairly large, which isn't necessarily a bad thing by itself, but it does mean we may end up with statistically "unnecessary" or useless frames.

All of these issues contribute to a non-ideal dataset. While it will still likely be better than other, similar datasets (seeing as theoretically you should be able to desample the lanczos upscale), we cannot expect most users to actually take that extra step or understand how to go about it. This also means we must expose the dataset on an FTP or seedbox of some sort, and/or create a torrent for distribution.

Potential solutions?

This is a difficult problem to solve.

Output format and method

The current format is RGBS. While this is standard and also likely what most tools for training models expect, this neglects the issues presented with dealing with video:

Currently, I know of no good method to export YUV 4:2:0 "as-is" to RGB or another common format that is also supported by training algorithms and is easy to distribute and use. This means we must perform our own upsampling, which introduces additional resampling artefacting and is less "true" to the material.
Most, if not practically all, videos accessible by consumers will be YUV 4:2:0. Even if we extrapolate this to production material, that still leaves us with 4:2:2 images. This necessitates that some resampling occurs to the chroma, which means there will always be some form of loss. To summarise, finding any truly ideal sources is impossible in the first place.

To deal with these, we must find solutions that:

[x] Allow us to export the image in as close a state as the original, with as little additional resampling as possible.
[ ] Find better sources to take data from.

The latter is likely impossible to work around with how anime productions work in the current day. Even if a streaming service or authoring company is willing to provide the source files to us for the purpose of creating a dataset, the masters they receive will already be degraded in some fashion.

EDIT: A solution proposed in this Discord message is to export each plane separately. This might be the most ideal workaround, at the cost of it being more difficult for users to make full use of from the get-go. An idea is to have one directory that's the "full" RGB images for ease-of-use, and then a separate directory with every original plane split?

Data in the dataset

This can be split up into two key points:

[ ] "Good" GT data (the data must be "useful" for people who want to train models)
[ ] Distribution

Distribution is the easiest problem to solve. Currently, my plan is to allow direct access to a directory on one of my seedboxes where I provide the entire dataset. The data in the repository is a small chunk-sized bite of it, as GitHub has upload and storage limits. Cloning a directory with potentially thousands of images also isn't pleasant, and will bloat the repository. Alongside providing DDLs, I'm also going to provide torrents on the Releases page that are updated periodically with new data.

The bigger problem is determining good ground truth data. This comes both in defining what "good" means (as this may vary depending on context of the model being trained), as well as determining how to programmatically select eligible frames that follow the definition we settle on. For the time being, I outline "good sources" as adhering to the following:

There must be clear lineart or background detail visible. This means we don't use frames with heavy motion blur, shakes, or other fx that distorts the image to the point of being unusuable.
Frames must not be starved. This means we avoid frames with heavy compression artefacting, such as banding, noise, etc.
Frames must not be heavily post-processed. This refers specifically to filtering applied after "final" mastering, primarily post-sharpening or lowpassing. This harms the image fidelity by introducing artefacting (predominantly ringing), as well as modifies i.e. the lineart to the point of looking unnatural.

Some of these issues, such as heavy post-processing, are typically a global issue. This means that the entire source will suffer from this, and as such they can be easily ignored, as we already do at this time. However, others happen on a per-scene level, such as motion blurring.

Digging through all the data collected and manually verifying every single image requires a lot of extra manhours. Ideally, we want some kind of automated approach that avoids or otherwise removes images that do not meet the above standards. To accomplish this, I can think of a number of solutions:

Use an edgemask to isolate frames with lots of hard edges. These will necessarily be frames that are relatively sharp, or have a lot of regular detail. This will deal with i.e. different kinds of blur, fades, etc. However, this does nothing to help us with isolating fringe post-sharpened frames, and will catch more hard edges on frames with compression noise or banding. This may also catch particularly strong dithering.
Same as above, but prefilter it with a weak gaussian blur. This will deal with some of the issues concerning it catching compression noise, banding, and strong dithering, as well as reduce the amount of hard edges found on particularly grainy scenes. Thos does not help with other potential issues however, most notably finding compression artefacting.
Using something like PSNR/SSIM that does not require a reference but can somehow determine how "compressed" a frame is, i.e. by magically determining how much of the image appears to be suffering from quantisation artefacting (most notably from h264-based encoders).

Implementing these and other solutions come with a number of constraints, primarily in terms of speed and ease of implementation (since we're tied to VapourSynth currently).

So what now?

I've been busy with other projects and work, so I haven't had much time to look at continuing this dataset, but, I do want to get back to this. However, before I can do that, I want to figure out some good ways to resolve the aforementioned issues. This issue exists to ask for community help, opinions, and feedback, as well as to figure out a road moving forward. Additionally, I hope this issue also helps give other dataset collectors a good idea of the constraints that exist with creating a proper dataset, as well as a list of things to watch out for.

Any ideas are welcome, but please keep in mind that the ease of use for the end user trumps all. I want to largely stick to providing just images in a way most users can make use of, and while I'm not strictly against packaging more tools to help the end user, it will need some convincing.

For the dataset format itself, consider hosting it in the webdataset format or parquet as that's the current industry standard for large scale image datasets in deep learning.

For the dataset GT issues, something to maybe look at is to create a manual "test" dataset that has a lot of good and bad types of frames, and use it to finetune a resnet or similar to classify frames to keep or chuck out. Even just having a test dataset where we know which frames are good or bad ourselves is useful for training either some kind of model or volunteers to sort through all the data.

To this end, I think the Magia Record and Assault Lily blu-rays are perfect. Magia Record contains a lot of starved grain, aliasing, etc., and Assault Lily has heavy banding. We don't want to be too restrictive either, however. We don't want to for example throw out a large amount of data just because there's some slight banding(-like structure), so we would have to tweak whatever we use carefully.

Just a few thoughts based on what was already discussed elsewhere:

We export as RGB. This means that we MUST produce a YUV 4:4:4 image to convert to RGB somehow.

Ideally, I think the best option is to simply distribute the untouched YCbCr planes in their original resolutions. The concern that this might be a bit more difficult to use is reasonable depending on which format is used to distribute the images, but simple conversion scripts can be written to extract luma or to upscale chroma and convert to RGB.

Distributing RGB directly is a bad idea because it involves a few lossy steps in the process of generating the RGB images, and this is obviously detrimental to most tasks this dataset would be useful for.

As for the distribution format itself, I'm particularly okay with anything but I think the idea of exporting the Numpy arrays directly is pretty convenient. I've played with it for a while and there are a few things worth considering:

The size of the resulting binary will largely depend on the data type. If you convert to floating point before exporting the filesize is much larger.
The size of the resulting binary can be reduced to some extent by just compressing the binary with standard file compression techniques.

We use lanczos to upscale the chroma to the luma resolution. Lanczos, while generally speaking the most preferable convolution-based scaler for upscaling video during playback, is not preferable when you want to train models for video restoration purposes. In an ideal scenario, we want to avoid upsampling if at all possible.

Using lanczos to upscale chroma would be okay if this dataset was targeting tasks like classification or segmentation, but for things like denoising or super-resolution this is a bad idea. My suggestion is to skip upscaling chroma entirely.

Our dataset will wind up fairly large, which isn't necessarily a bad thing by itself, but it does mean we may end up with statistically "unnecessary" or useless frames.

While automated processes involving no-reference image quality metrics can be leveraged to help us pick "good" frames, ultimately I think we'll probably need some human filter to discard bad frames by hand. More data is usually better from a training perspective, but it's also worth keeping in mind that most standard SISR training datasets aren't particularly huge. DIV2K for example only has 800 images.

Ideally, I think the best option is to simply distribute the untouched YCbCr planes in their original resolutions. The concern that this might be a bit more difficult to use is reasonable depending on which format is used to distribute the images, but simple conversion scripts can be written to extract luma or to upscale chroma and convert to RGB.

This will probably be the direction I'm going in for the time being, with the only potential caveat being that Vapoursynth tooling (which is necessary in many instances to fix defects such as lowpassing) will always output individual planes as GRAY. This shouldn't be a huge deal, but may cause some trouble in specific setups I'm not familiar with.

I'm also considering including a JSON per entry that stores the following information:

{
  "show_name": "Random anime title",
  "color_space": "BT.709",
  "chroma_subsampling": 420,
  "frames": [
    {
      "frame_number": 3,
      "planes": {
        "Y": "frame_000003_y.png",
        "U": "frame_000003_u.png",
        "V": "frame_000003_v.png"
      }
    },
    {
      "frame_number": 42,
      "planes": {
        "Y": "frame_000042_y.png",
        "U": "frame_000042_u.png",
        "V": "frame_000042_v.png"
      }
    },
    {
      "frame_number": 155,
      "planes": {
        "Y": "frame_000155_y.png",
        "U": "frame_000155_u.png",
        "V": "frame_000155_v.png"
      }
    },
    ...
  ]
}

With the main impetus being to keep frames grouped together without also having a gazillion different directories that are annoying to navigate. The user will also be able to easily iterate over this JSON to reconstruct full frames again if necessary. Still not 100% sold on including one though, as the user can just as easily iterate over the entire directory themselves and reconstruct frames using the filenames, making a JSON kinda redundant, no matter how easy it would be to generate one.

Either way though, this satisfies one of the problems mentioned in this issue. I will see if I can write an export script that exports planes individually in the coming days, and also think up some solutions for the remaining problems.

Light-Anime-Datasets / FHD-Anime