Open Asmedeus998 opened 1 month ago
I also have a error when use database python file makedb.py
can you tell me what json you use?
Thank you for your interest!
The last time I downloaded data from BigQuery, the posts json no longer included the md5 value. So any reference to that field needs to be removed. Rather than change the database schema, I used a constant value.
So, for example, in this section in make_db.py
, the following lines:
c.execute("INSERT OR IGNORE INTO images VALUES (?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?)",
(pId,json_line["rating"],json_line["source"],pixiv_id,
json_line["image_width"],json_line["image_height"],json_line["created_at"],json_line["updated_at"],
json_line["uploader_id"],int(json_line["is_banned"]),int(json_line["is_deleted"]),int(json_line["is_flagged"]),
json_line["file_ext"],json_line["file_size"],json_line["md5"],int(json_line["has_children"]),
int(json_line["has_visible_children"]),int(json_line["has_active_children"]),parent_id
))
are changed to:
c.execute("INSERT OR IGNORE INTO images VALUES (?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?)",
(pId,json_line["rating"],json_line["source"],pixiv_id,
json_line["image_width"],json_line["image_height"],json_line["created_at"],json_line["updated_at"],
json_line["uploader_id"],int(json_line["is_banned"]),int(json_line["is_deleted"]),int(json_line["is_flagged"]),
json_line["file_ext"],json_line["file_size"],"0",int(json_line["has_children"]),
int(json_line["has_visible_children"]),int(json_line["has_active_children"]),parent_id
))
Note how on the 5th line, json_line["md5"]
is replaced by "0"
, i.e. all rows in the database will have their md5
column value set to zero.
That is the only place where a reference to md5
needs to be changed. I can't find any similar issues for other tables (e.g. tags, artists, etc) but the same process would apply if necessary.
Here is an example post
entry from the 2021 metadata [image id 80440], formatted:
{
"tag_string_general":"1girl areolae barefoot black_hair breasts breasts_apart clitoris excessive_pubic_hair female_masturbation female_pubic_hair full_body green_headband half-closed_eyes headband large_breasts masturbation medium_breasts narrow_waist navel nipples nude object_insertion pubic_hair pussy short_hair signature solo toenails toon_(style) uncensored vaginal vaginal_object_insertion watermark zoom_layer",
"has_visible_children":false,
"has_large":false,
"tag_count_meta":"0",
"bit_flags":"0",
"has_active_children":false,
"preview_file_url":"https://cdn.donmai.us/preview/19/20/1920b4bb0e0a29c83300d86f0f006322.jpg",
"image_width":"496",
"updated_at":"2021-12-11 20:07:38.553 UTC",
"tag_string_artist":"sirkowski",
"tag_count":"37",
"is_status_locked":false,
"is_pending":false,
"file_size":"124950",
"tag_count_character":"1",
"tag_count_artist":"1",
"large_file_url":"https://cdn.donmai.us/original/19/20/1920b4bb0e0a29c83300d86f0f006322.jpg",
"tag_count_general":"34",
"fav_count":"0",
"has_children":false,
"tag_count_copyright":"1",
"is_deleted":true,
"md5":"1920b4bb0e0a29c83300d86f0f006322",
"down_score":"0",
"is_flagged":false,
"is_note_locked":false,
"source":"",
"score":"0",
"tag_string":"1girl areolae barefoot beyond_good_and_evil black_hair breasts breasts_apart clitoris excessive_pubic_hair female_masturbation female_pubic_hair full_body green_headband half-closed_eyes headband jade_(beyond_good_and_evil) large_breasts masturbation medium_breasts narrow_waist navel nipples nude object_insertion pubic_hair pussy short_hair signature sirkowski solo toenails toon_(style) uncensored vaginal vaginal_object_insertion watermark zoom_layer",
"tag_string_meta":"",
"is_rating_locked":false,
"created_at":"2006-10-24 20:30:49 UTC",
"rating":"e",
"tag_string_copyright":"beyond_good_and_evil",
"id":"80440",
"image_height":"818",
"file_url":"https://cdn.donmai.us/original/19/20/1920b4bb0e0a29c83300d86f0f006322.jpg",
"tag_string_character":"jade_(beyond_good_and_evil)",
"is_banned":false,
"uploader_id":"1",
"up_score":"0",
"file_ext":"jpg"
}
I see that gwern has taken the Danbooru 2021 metadata off-line. That is unfortunate ...
Thank you for your interest!
The last time I downloaded data from BigQuery, the posts json no longer included the md5 value. So any reference to that field needs to be removed. Rather than change the database schema, I used a constant value.
So, for example, in this section in
make_db.py
, the following lines:c.execute("INSERT OR IGNORE INTO images VALUES (?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?)", (pId,json_line["rating"],json_line["source"],pixiv_id, json_line["image_width"],json_line["image_height"],json_line["created_at"],json_line["updated_at"], json_line["uploader_id"],int(json_line["is_banned"]),int(json_line["is_deleted"]),int(json_line["is_flagged"]), json_line["file_ext"],json_line["file_size"],json_line["md5"],int(json_line["has_children"]), int(json_line["has_visible_children"]),int(json_line["has_active_children"]),parent_id ))
are changed to:
c.execute("INSERT OR IGNORE INTO images VALUES (?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?)", (pId,json_line["rating"],json_line["source"],pixiv_id, json_line["image_width"],json_line["image_height"],json_line["created_at"],json_line["updated_at"], json_line["uploader_id"],int(json_line["is_banned"]),int(json_line["is_deleted"]),int(json_line["is_flagged"]), json_line["file_ext"],json_line["file_size"],"0",int(json_line["has_children"]), int(json_line["has_visible_children"]),int(json_line["has_active_children"]),parent_id ))
Note how on the 5th line,
json_line["md5"]
is replaced by"0"
, i.e. all rows in the database will have theirmd5
column value set to zero.That is the only place where a reference to
md5
needs to be changed. I can't find any similar issues for other tables (e.g. tags, artists, etc) but the same process would apply if necessary.
thanks your fixed really help me out.
can you tell me how you check md5 not working anymore?
the GCP post.json
file and danbooru API
which I found the string of md5 so I don know when they suddenly stop using it.
if you using bigquery to collect the danbooru API i interested in learning it.
thanks your fixed really help me out. can you tell me how you check md5 not working anymore? the GCP post.json file and danbooru API which I found the string of md5 so I don know when they suddenly stop using it. if you using bigquery to collect the danbooru API i interested in learning it.
OK, I'm starting to remember the details now. I did not download the post.json file from GCP as you linked; I manually downloaded a partial query, and inadvertently dropped the md5
column. As a result, I got the KeyError: 'md5'
message, and assumed a similar scenario was the cause for your situation.
So I've not seen any of the metadata from GCP. You could help me get on the "same page" if you could upload a single json record from posts.json, as I did from the 2021 metadata.
A python KeyError
means the specified field doesn't exist. In my case, there was no md5
value. Note the field is case sensitive: if the md5 field exists in the GCP data but has a different name, then the KeyError
would still occur.
are /Danbooru2021/database/makedb.py
still useable? even thought I only have posts.json
and tags.json
in metadata folder
I keep getting keyerror
I keep getting keyerror
Without examples of json records from posts.json
and tags.json
, it's difficult for me to diagnose the problem. I don't have paid access to BigQuery so I can't download them myself. If you'd post sample records, I'll take a look.
It occurs to me that a record in the json might be incomplete, which would cause a KeyError
to occur on that record. To skip records which are incomplete, add exception handling as so:
try:
# existing code, idented
c.execute("INSERT OR IGNORE INTO images VALUES (?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?,?, ?,?,?)",
...
buildImageTags( ... ) # existing 5 calls to buildImageTags
except:
pass # consider printing some info for debugging
If the KeyError
is general, you'll need to fix it, otherwise no records will be loaded.
are /Danbooru2021/database/makedb.py still useable? even thought I only have posts.json and tags.json in metadata folder I keep getting keyerror
Just in case of a communication disconnect ... The version of makedb.py
in github
is for the 2021 metadata, and I'm not making any changes to that version. So continuing to use that version will continue to cause a KeyError
.
To address your issue with the GCP latest metadata, you need to modify and execute your local copy of makedb.py
as necessary.
hello,
I interesting in your project and want to recreate the meta database
so I want to recreate the database by downloading json from GCP: https://console.cloud.google.com/storage/browser/danbooru_public/data?project=danbooru1
but I get the following error when I try to replace all
/mnt/D2/metadata/{filename.json}
here is the error:
from:
replace:
since I don have your previous data I can't check the error.
hope you can help me out thanks.