fire-eggs / Danbooru2021

Python scripts and tools for working with the Danbooru2022 data set. Note: this is a sqlite database and a viewer, not directly related to machine learning.
https://www.gwern.net/Danbooru2021
MIT License
42 stars 2 forks source link

error when trying create db `make_db.py` in database2021 subdirectory #60

Open Asmedeus998 opened 1 month ago

Asmedeus998 commented 1 month ago

hello,

I interesting in your project and want to recreate the meta database

so I want to recreate the database by downloading json from GCP: https://console.cloud.google.com/storage/browser/danbooru_public/data?project=danbooru1

but I get the following error when I try to replace all /mnt/D2/metadata/{filename.json}

here is the error:

json_line["file_ext"],json_line["file_size"],json_line["md5"],int(json_line["has_children"]),
KeyError: 'md5'

from:

import_tags("/mnt/D2/metadata/tags000000000000.json", conn)
import_artists("/mnt/D2/metadata/artists000000000000.json", conn)
import_notes("/mnt/D2/metadata/notes000000000000.json", conn)
import_pools("/mnt/D2/metadata/pools000000000000.json", conn)

replace:

import_tags("~/danbooru_public/tags.json", conn)
import_notes("~/danbooru_public/notes.json", conn)
import_pools("~/danbooru_public/pools.json", conn)

since I don have your previous data I can't check the error.

hope you can help me out thanks.

Asmedeus998 commented 1 month ago

I also have a error when use database python file makedb.py can you tell me what json you use?

fire-eggs commented 1 month ago

Thank you for your interest!

The last time I downloaded data from BigQuery, the posts json no longer included the md5 value. So any reference to that field needs to be removed. Rather than change the database schema, I used a constant value.

So, for example, in this section in make_db.py, the following lines:

    c.execute("INSERT OR IGNORE INTO images VALUES (?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?)",
              (pId,json_line["rating"],json_line["source"],pixiv_id,
              json_line["image_width"],json_line["image_height"],json_line["created_at"],json_line["updated_at"],
              json_line["uploader_id"],int(json_line["is_banned"]),int(json_line["is_deleted"]),int(json_line["is_flagged"]),
              json_line["file_ext"],json_line["file_size"],json_line["md5"],int(json_line["has_children"]),
              int(json_line["has_visible_children"]),int(json_line["has_active_children"]),parent_id
              ))

are changed to:

    c.execute("INSERT OR IGNORE INTO images VALUES (?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?)",
              (pId,json_line["rating"],json_line["source"],pixiv_id,
              json_line["image_width"],json_line["image_height"],json_line["created_at"],json_line["updated_at"],
              json_line["uploader_id"],int(json_line["is_banned"]),int(json_line["is_deleted"]),int(json_line["is_flagged"]),
              json_line["file_ext"],json_line["file_size"],"0",int(json_line["has_children"]),
              int(json_line["has_visible_children"]),int(json_line["has_active_children"]),parent_id
              ))

Note how on the 5th line, json_line["md5"] is replaced by "0", i.e. all rows in the database will have their md5 column value set to zero.

That is the only place where a reference to md5 needs to be changed. I can't find any similar issues for other tables (e.g. tags, artists, etc) but the same process would apply if necessary.

fire-eggs commented 1 month ago

Here is an example post entry from the 2021 metadata [image id 80440], formatted:

{
   "tag_string_general":"1girl areolae barefoot black_hair breasts breasts_apart clitoris excessive_pubic_hair female_masturbation female_pubic_hair full_body green_headband half-closed_eyes headband large_breasts masturbation medium_breasts narrow_waist navel nipples nude object_insertion pubic_hair pussy short_hair signature solo toenails toon_(style) uncensored vaginal vaginal_object_insertion watermark zoom_layer",
   "has_visible_children":false,
   "has_large":false,
   "tag_count_meta":"0",
   "bit_flags":"0",
   "has_active_children":false,
   "preview_file_url":"https://cdn.donmai.us/preview/19/20/1920b4bb0e0a29c83300d86f0f006322.jpg",
   "image_width":"496",
   "updated_at":"2021-12-11 20:07:38.553 UTC",
   "tag_string_artist":"sirkowski",
   "tag_count":"37",
   "is_status_locked":false,
   "is_pending":false,
   "file_size":"124950",
   "tag_count_character":"1",
   "tag_count_artist":"1",
   "large_file_url":"https://cdn.donmai.us/original/19/20/1920b4bb0e0a29c83300d86f0f006322.jpg",
   "tag_count_general":"34",
   "fav_count":"0",
   "has_children":false,
   "tag_count_copyright":"1",
   "is_deleted":true,
   "md5":"1920b4bb0e0a29c83300d86f0f006322",
   "down_score":"0",
   "is_flagged":false,
   "is_note_locked":false,
   "source":"",
   "score":"0",
   "tag_string":"1girl areolae barefoot beyond_good_and_evil black_hair breasts breasts_apart clitoris excessive_pubic_hair female_masturbation female_pubic_hair full_body green_headband half-closed_eyes headband jade_(beyond_good_and_evil) large_breasts masturbation medium_breasts narrow_waist navel nipples nude object_insertion pubic_hair pussy short_hair signature sirkowski solo toenails toon_(style) uncensored vaginal vaginal_object_insertion watermark zoom_layer",
   "tag_string_meta":"",
   "is_rating_locked":false,
   "created_at":"2006-10-24 20:30:49 UTC",
   "rating":"e",
   "tag_string_copyright":"beyond_good_and_evil",
   "id":"80440",
   "image_height":"818",
   "file_url":"https://cdn.donmai.us/original/19/20/1920b4bb0e0a29c83300d86f0f006322.jpg",
   "tag_string_character":"jade_(beyond_good_and_evil)",
   "is_banned":false,
   "uploader_id":"1",
   "up_score":"0",
   "file_ext":"jpg"
}
fire-eggs commented 1 month ago

I see that gwern has taken the Danbooru 2021 metadata off-line. That is unfortunate ...

Asmedeus998 commented 1 month ago

Thank you for your interest!

The last time I downloaded data from BigQuery, the posts json no longer included the md5 value. So any reference to that field needs to be removed. Rather than change the database schema, I used a constant value.

So, for example, in this section in make_db.py, the following lines:

    c.execute("INSERT OR IGNORE INTO images VALUES (?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?)",
              (pId,json_line["rating"],json_line["source"],pixiv_id,
              json_line["image_width"],json_line["image_height"],json_line["created_at"],json_line["updated_at"],
              json_line["uploader_id"],int(json_line["is_banned"]),int(json_line["is_deleted"]),int(json_line["is_flagged"]),
              json_line["file_ext"],json_line["file_size"],json_line["md5"],int(json_line["has_children"]),
              int(json_line["has_visible_children"]),int(json_line["has_active_children"]),parent_id
              ))

are changed to:

    c.execute("INSERT OR IGNORE INTO images VALUES (?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?)",
              (pId,json_line["rating"],json_line["source"],pixiv_id,
              json_line["image_width"],json_line["image_height"],json_line["created_at"],json_line["updated_at"],
              json_line["uploader_id"],int(json_line["is_banned"]),int(json_line["is_deleted"]),int(json_line["is_flagged"]),
              json_line["file_ext"],json_line["file_size"],"0",int(json_line["has_children"]),
              int(json_line["has_visible_children"]),int(json_line["has_active_children"]),parent_id
              ))

Note how on the 5th line, json_line["md5"] is replaced by "0", i.e. all rows in the database will have their md5 column value set to zero.

That is the only place where a reference to md5 needs to be changed. I can't find any similar issues for other tables (e.g. tags, artists, etc) but the same process would apply if necessary.

thanks your fixed really help me out.

can you tell me how you check md5 not working anymore?

the GCP post.json file and danbooru API which I found the string of md5 so I don know when they suddenly stop using it.

if you using bigquery to collect the danbooru API i interested in learning it.

fire-eggs commented 1 month ago

thanks your fixed really help me out. can you tell me how you check md5 not working anymore? the GCP post.json file and danbooru API which I found the string of md5 so I don know when they suddenly stop using it. if you using bigquery to collect the danbooru API i interested in learning it.

OK, I'm starting to remember the details now. I did not download the post.json file from GCP as you linked; I manually downloaded a partial query, and inadvertently dropped the md5 column. As a result, I got the KeyError: 'md5' message, and assumed a similar scenario was the cause for your situation.

So I've not seen any of the metadata from GCP. You could help me get on the "same page" if you could upload a single json record from posts.json, as I did from the 2021 metadata.

A python KeyError means the specified field doesn't exist. In my case, there was no md5 value. Note the field is case sensitive: if the md5 field exists in the GCP data but has a different name, then the KeyError would still occur.

Asmedeus998 commented 1 month ago

are /Danbooru2021/database/makedb.py still useable? even thought I only have posts.json and tags.json in metadata folder

I keep getting keyerror

fire-eggs commented 1 month ago

I keep getting keyerror

Without examples of json records from posts.json and tags.json, it's difficult for me to diagnose the problem. I don't have paid access to BigQuery so I can't download them myself. If you'd post sample records, I'll take a look.

fire-eggs commented 1 month ago

It occurs to me that a record in the json might be incomplete, which would cause a KeyError to occur on that record. To skip records which are incomplete, add exception handling as so:

    try:
       # existing code, idented
       c.execute("INSERT OR IGNORE INTO images VALUES (?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?)",
           ...
       buildImageTags( ... ) # existing 5 calls to buildImageTags
    except:
        pass    # consider printing some info for debugging

If the KeyError is general, you'll need to fix it, otherwise no records will be loaded.

fire-eggs commented 1 month ago

are /Danbooru2021/database/makedb.py still useable? even thought I only have posts.json and tags.json in metadata folder I keep getting keyerror

Just in case of a communication disconnect ... The version of makedb.py in github is for the 2021 metadata, and I'm not making any changes to that version. So continuing to use that version will continue to cause a KeyError.

To address your issue with the GCP latest metadata, you need to modify and execute your local copy of makedb.py as necessary.