Open LightArrowsEXE opened 3 months ago
For the dataset format itself, consider hosting it in the webdataset format or parquet as that's the current industry standard for large scale image datasets in deep learning.
For the dataset GT issues, something to maybe look at is to create a manual "test" dataset that has a lot of good and bad types of frames, and use it to finetune a resnet or similar to classify frames to keep or chuck out. Even just having a test dataset where we know which frames are good or bad ourselves is useful for training either some kind of model or volunteers to sort through all the data.
To this end, I think the Magia Record and Assault Lily blu-rays are perfect. Magia Record contains a lot of starved grain, aliasing, etc., and Assault Lily has heavy banding. We don't want to be too restrictive either, however. We don't want to for example throw out a large amount of data just because there's some slight banding(-like structure), so we would have to tweak whatever we use carefully.
Just a few thoughts based on what was already discussed elsewhere:
We export as RGB. This means that we MUST produce a YUV 4:4:4 image to convert to RGB somehow.
Ideally, I think the best option is to simply distribute the untouched YCbCr planes in their original resolutions. The concern that this might be a bit more difficult to use is reasonable depending on which format is used to distribute the images, but simple conversion scripts can be written to extract luma or to upscale chroma and convert to RGB.
Distributing RGB directly is a bad idea because it involves a few lossy steps in the process of generating the RGB images, and this is obviously detrimental to most tasks this dataset would be useful for.
As for the distribution format itself, I'm particularly okay with anything but I think the idea of exporting the Numpy arrays directly is pretty convenient. I've played with it for a while and there are a few things worth considering:
We use lanczos to upscale the chroma to the luma resolution. Lanczos, while generally speaking the most preferable convolution-based scaler for upscaling video during playback, is not preferable when you want to train models for video restoration purposes. In an ideal scenario, we want to avoid upsampling if at all possible.
Using lanczos to upscale chroma would be okay if this dataset was targeting tasks like classification or segmentation, but for things like denoising or super-resolution this is a bad idea. My suggestion is to skip upscaling chroma entirely.
Our dataset will wind up fairly large, which isn't necessarily a bad thing by itself, but it does mean we may end up with statistically "unnecessary" or useless frames.
While automated processes involving no-reference image quality metrics can be leveraged to help us pick "good" frames, ultimately I think we'll probably need some human filter to discard bad frames by hand. More data is usually better from a training perspective, but it's also worth keeping in mind that most standard SISR training datasets aren't particularly huge. DIV2K for example only has 800 images.
Ideally, I think the best option is to simply distribute the untouched YCbCr planes in their original resolutions. The concern that this might be a bit more difficult to use is reasonable depending on which format is used to distribute the images, but simple conversion scripts can be written to extract luma or to upscale chroma and convert to RGB.
This will probably be the direction I'm going in for the time being, with the only potential caveat being that Vapoursynth tooling (which is necessary in many instances to fix defects such as lowpassing) will always output individual planes as GRAY. This shouldn't be a huge deal, but may cause some trouble in specific setups I'm not familiar with.
I'm also considering including a JSON per entry that stores the following information:
{
"show_name": "Random anime title",
"color_space": "BT.709",
"chroma_subsampling": 420,
"frames": [
{
"frame_number": 3,
"planes": {
"Y": "frame_000003_y.png",
"U": "frame_000003_u.png",
"V": "frame_000003_v.png"
}
},
{
"frame_number": 42,
"planes": {
"Y": "frame_000042_y.png",
"U": "frame_000042_u.png",
"V": "frame_000042_v.png"
}
},
{
"frame_number": 155,
"planes": {
"Y": "frame_000155_y.png",
"U": "frame_000155_u.png",
"V": "frame_000155_v.png"
}
},
...
]
}
With the main impetus being to keep frames grouped together without also having a gazillion different directories that are annoying to navigate. The user will also be able to easily iterate over this JSON to reconstruct full frames again if necessary. Still not 100% sold on including one though, as the user can just as easily iterate over the entire directory themselves and reconstruct frames using the filenames, making a JSON kinda redundant, no matter how easy it would be to generate one.
Either way though, this satisfies one of the problems mentioned in this issue. I will see if I can write an export script that exports planes individually in the coming days, and also think up some solutions for the remaining problems.
The problems
Currently, every image for the dataset is output using the method described here. However, this export method comes with various issues:
All of these issues contribute to a non-ideal dataset. While it will still likely be better than other, similar datasets (seeing as theoretically you should be able to desample the lanczos upscale), we cannot expect most users to actually take that extra step or understand how to go about it. This also means we must expose the dataset on an FTP or seedbox of some sort, and/or create a torrent for distribution.
Potential solutions?
This is a difficult problem to solve.
Output format and method
The current format is RGBS. While this is standard and also likely what most tools for training models expect, this neglects the issues presented with dealing with video:
To deal with these, we must find solutions that:
The latter is likely impossible to work around with how anime productions work in the current day. Even if a streaming service or authoring company is willing to provide the source files to us for the purpose of creating a dataset, the masters they receive will already be degraded in some fashion.
EDIT: A solution proposed in this Discord message is to export each plane separately. This might be the most ideal workaround, at the cost of it being more difficult for users to make full use of from the get-go. An idea is to have one directory that's the "full" RGB images for ease-of-use, and then a separate directory with every original plane split?
Data in the dataset
This can be split up into two key points:
Distribution is the easiest problem to solve. Currently, my plan is to allow direct access to a directory on one of my seedboxes where I provide the entire dataset. The data in the repository is a small chunk-sized bite of it, as GitHub has upload and storage limits. Cloning a directory with potentially thousands of images also isn't pleasant, and will bloat the repository. Alongside providing DDLs, I'm also going to provide torrents on the Releases page that are updated periodically with new data.
The bigger problem is determining good ground truth data. This comes both in defining what "good" means (as this may vary depending on context of the model being trained), as well as determining how to programmatically select eligible frames that follow the definition we settle on. For the time being, I outline "good sources" as adhering to the following:
Some of these issues, such as heavy post-processing, are typically a global issue. This means that the entire source will suffer from this, and as such they can be easily ignored, as we already do at this time. However, others happen on a per-scene level, such as motion blurring.
Digging through all the data collected and manually verifying every single image requires a lot of extra manhours. Ideally, we want some kind of automated approach that avoids or otherwise removes images that do not meet the above standards. To accomplish this, I can think of a number of solutions:
Implementing these and other solutions come with a number of constraints, primarily in terms of speed and ease of implementation (since we're tied to VapourSynth currently).
So what now?
I've been busy with other projects and work, so I haven't had much time to look at continuing this dataset, but, I do want to get back to this. However, before I can do that, I want to figure out some good ways to resolve the aforementioned issues. This issue exists to ask for community help, opinions, and feedback, as well as to figure out a road moving forward. Additionally, I hope this issue also helps give other dataset collectors a good idea of the constraints that exist with creating a proper dataset, as well as a list of things to watch out for.
Any ideas are welcome, but please keep in mind that the ease of use for the end user trumps all. I want to largely stick to providing just images in a way most users can make use of, and while I'm not strictly against packaging more tools to help the end user, it will need some convincing.