JD-P / simulacra-aesthetic-captions

Dataset of prompts, synthetic AI generated images, and aesthetic ratings.
395 stars 18 forks source link

Invalid characters in paths causing missing files and extraction issues. #1

Open kjerk opened 2 years ago

kjerk commented 2 years ago

Hey thanks for all the work on this Dataset, the efforts are greatly appreciated.

Problem

I've been seeing an issue with the tar archives in v1 of the dataset that cause issues in most applications I've looked at and even plain old TAR itself. The result being that files are going to go missing when extracted or even if you mount the file, depending on the TAR library doing the parsing. I wanted to point this out in case anyone else wound up in the same boat I'm in.

Cause?

I'm guessing that the tar archives were created programmatically by some software that didn't have file-path worries on its mind, so there are some forbidden, or raw whitespace characters in prompt titles that were then copied to the filenames, which make applications angry. I discovered this when trying to re-combine archives into other formats IE zstd or Brotli archives, and running into a bunch of collisions or odd paths.

Examples from sac-000000.tar

Escaped Backslash:
home/jdp/simulacra-aesthetic-captions/32003_melancholic_cuboid_chained_Luminism_Angel_by_Peter_MohrBacher_WLOP_Alphonse_Mucha_J_C_Leyendecker_Ruan_Jia_and_Beksinsk_Featured_on_ArtStation\\_2.png

Newline:
home/jdp/simulacra-aesthetic-captions/6052_A_lamp_lighting_up_a_room_#lighting_#electronics\n_#artstation_#purple_#pastel_#color_#watercolor_2.png

Newline:
home/jdp/simulacra-aesthetic-captions/15711_Their_bodies_are_almost_always_hidden_by_several_layers_of_cloaks_and_by_metallic_helmets_with_visors_made_of_glass,_\n_detailed_digital_art_by_Greg_Rutkowski_and_Erik_Bulatov,_cgsociety_trending_on_artstation_8.png

Misc Quotations:
home/jdp/simulacra-aesthetic-captions/30132_The_man_said,_\n"Dude,_I_don't_know_that_I_like_those_eyes.""they're_still_looking_at_us._they're_good_eyes.""what_makes_you_say_that?,"_the_man_replied."oh,_it's_just_a_feeling."_3.png

The end result is that many applications, even 7z or tar itself will try to gracefully fail and wind up misnaming files, so they will no longer match the paths in the sqlite database, or will extract to random places, etc.

image

In the brand newest version of Sevenzip, 'invalid' characters are replaced with underscores, but this means that again the file's extracted path will not be the same as in the SQLite db and so the file is effectively 'missing' if you are trying to programmatically match a path to a record to retrieve the image's score.

Digging:

> tar --list -f sac-000000.tar | grep -E '\\n|\\\\|""'
home/jdp/simulacra-aesthetic-captions/6052_A_lamp_lighting_up_a_room_#lighting_#electronics\n_#artstation_#purple_#pastel_#color_#watercolor_2.png
home/jdp/simulacra-aesthetic-captions/9292_In_film_gray\nTry_to_change\nThe_frame_rates_intermixing\nIn_your_chronic_carbon_system_7.png
home/jdp/simulacra-aesthetic-captions/5761_At_the_school_playground_at_your_friends_#friends_#embarrassingmoment\n_#artsy_#anime_#cartoon_2.png
home/jdp/simulacra-aesthetic-captions/32003_melancholic_cuboid_chained_Luminism_Angel_by_Peter_MohrBacher_WLOP_Alphonse_Mucha_J_C_Leyendecker_Ruan_Jia_and_Beksinsk_Featured_on_ArtStation\\_2.png
home/jdp/simulacra-aesthetic-captions/15711_Their_bodies_are_almost_always_hidden_by_several_layers_of_cloaks_and_by_metallic_helmets_with_visors_made_of_glass,_\n_detailed_digital_art_by_Greg_Rutkowski_and_Erik_Bulatov,_cgsociety_trending_on_artstation_8.png
home/jdp/simulacra-aesthetic-captions/30132_The_man_said,_\n"Dude,_I_don't_know_that_I_like_those_eyes.""they're_still_looking_at_us._they're_good_eyes.""what_makes_you_say_that?,"_the_man_replied."oh,_it's_just_a_feeling."_3.png
kjerk commented 2 years ago

Capitalization mismatches:

sac-000000.tar ->
home/jdp/simulacra-aesthetic-captions/14135_"There's_something_wrong_in_this_village."_-_color_manga_illustration_by_junji_ito._high_quality_7.png

--------
SELECT t.* FROM paths t WHERE path like '14135%' LIMIT 500
108393,"14135_""There's_something_wrong_in_this_village.""_-_color_manga_illustration_by_Junji_Ito._high_quality_1.png"
108394,"14135_""There's_something_wrong_in_this_village.""_-_color_manga_illustration_by_Junji_Ito._high_quality_2.png"
108395,"14135_""There's_something_wrong_in_this_village.""_-_color_manga_illustration_by_Junji_Ito._high_quality_3.png"
108396,"14135_""There's_something_wrong_in_this_village.""_-_color_manga_illustration_by_Junji_Ito._high_quality_4.png"
108397,"14135_""There's_something_wrong_in_this_village.""_-_color_manga_illustration_by_Junji_Ito._high_quality_5.png"
108398,"14135_""There's_something_wrong_in_this_village.""_-_color_manga_illustration_by_Junji_Ito._high_quality_6.png"
108399,"14135_""There's_something_wrong_in_this_village.""_-_color_manga_illustration_by_Junji_Ito._high_quality_7.png"
108400,"14135_""There's_something_wrong_in_this_village.""_-_color_manga_illustration_by_Junji_Ito._high_quality_8.png"
kjerk commented 2 years ago

Okay there may be a much more catastrophic problem here, probably worth another issue. This came up doing a full parse of the entries to files.

SELECT count(*) FROM images
LEFT JOIN paths ON paths.iid = images.id
LEFT JOIN ratings ON ratings.iid = images.id
WHERE ratings.rating IS NULL

count(*) = 92520

Or around 40% of the dataset has no reconcilable ratings in the data source.

henry501 commented 2 years ago

This also causes issues when extracting to an exfat formatted drive due to the invalid characters in filenames. Reference from wikipedia 8BE73EE0-4185-429E-B153-C08AC3074105

kjerk commented 2 years ago

Just to follow up on this, I wrote a python script to manually process the tar files and stream decompress them into a clean directory structure, keeping the IDs in place and stripping everything else, so if anyone else lands here and still wants to be able to access the files currently, here's a solution. You only need to go to the bottom of the script and replace the input path to the input folder path where you have the .tar archives, and the output directory path where you want the renamed pngs extracted to, directories will be auto created:

import pathlib
import tarfile

def reprocess_archives(path_input_base, path_output_base):
    archive_paths = list(sorted([p for p in path_input_base.glob('*.tar')]))

    for input_tar_path in archive_paths:
        path_output_dir = path_output_base / input_tar_path.stem

        if not path_output_dir.exists():
            path_output_dir.mkdir()

        with tarfile.open(name=input_tar_path, mode='r', bufsize=10240) as tf:
            print(f'Extracting {input_tar_path}...')

            file_ct = 0

            entry = tf.next()  # type: tarfile.TarInfo
            while entry is not None:
                if not entry.isfile(): continue

                # Replace prefix and leave only leading 'gid' and 'index' | sac-000000/123_1.png
                new_name = entry.name.replace('home/jdp/simulacra-aesthetic-captions/', '')
                name_parts = new_name.split('_')
                new_name = f'{name_parts[0]}_{name_parts[-1]}'

                path_output_file = path_output_dir / new_name

                # Override output filepath/name
                tf._extract_member(entry, str(path_output_file), set_attrs=False, numeric_owner=False)

                file_ct += 1
                entry = tf.next()
            print(f'  Extracted {file_ct} files.')

if __name__ == '__main__':
    path_input_base = pathlib.Path(r'/path/to/simulacra-aesthetic-captions')
    path_output_base = pathlib.Path(r'/path/to/simulacra-aesthetic-captions-output')

    reprocess_archives(path_input_base, path_output_base)

This creates a structure like this, where the files are output named for their gid and index, which can then still be looked up in the images table in the sqlite db, to get their imageid (iid) (and then the rating after that, if there is one.)

├───sac-000000
│       32_7.png
│       4_5.png
│       8_7.png
│
└───sac-000001
        1681_5.png
        3215_6.png
        941_3.png
JD-P commented 2 years ago

Okay there may be a much more catastrophic problem here, probably worth another issue. This came up doing a full parse of the entries to files.

SELECT count(*) FROM images
LEFT JOIN paths ON paths.iid = images.id
LEFT JOIN ratings ON ratings.iid = images.id
WHERE ratings.rating IS NULL

count(*) = 92520

Or around 40% of the dataset has no reconcilable ratings in the data source.

This is intentional, the images are public domain and have utility outside of being rated. You can check the exact export process in https://github.com/JD-P/simulacrabot/blob/imagen/export_dataset.py. In short every gen that was not flagged is in the dataset and the bot doesn't force you to rate, so only most images are rated.

kjerk commented 2 years ago

Okay there may be a much more catastrophic problem here, probably worth another issue. This came up doing a full parse of the entries to files.

SELECT count(*) FROM images
LEFT JOIN paths ON paths.iid = images.id
LEFT JOIN ratings ON ratings.iid = images.id
WHERE ratings.rating IS NULL

count(*) = 92520 Or around 40% of the dataset has no reconcilable ratings in the data source.

This is intentional, the images are public domain and have utility outside of being rated. You can check the exact export process in https://github.com/JD-P/simulacrabot/blob/imagen/export_dataset.py. In short every gen that was not flagged is in the dataset and the bot doesn't force you to rate, so only most images are rated.

Okay, it's good to hear at least that it's just my assumption was wrong that all the images included in the dataset had ratings