Closed rc-gr closed 10 months ago
As an informal suggestion for the interim, which sidesteps this issue, the previous clipboard method could be provided as an option so that the program would read from image_clipboard.txt
as before. However, if any of the URLs fail to return an image, then it has to be skipped without any alternatives.
And currently, without #7 addressed, any failed downloads would prevent subsequent files from being processed and zipped. From what I can tell, I'd get None
from these downloads. Thus, if I have them like so on this line:
[foo, None, bar, None, None, baz, ...]
, where anything else but None
are valid elements, only foo
will be output to the zip file.
Also, if 1000 is indeed the hard limit, perhaps an option could be added to address the index numbers such that it could start at any arbitrary number. As an example, the option might be something like --index_start 551
. That way, if I have the abovementioned collections in My Saves, I could then rearrange it (via the Collections button in the Edge Browser itself) like so in another run:
Collection 2 (500 items)
Collection 1 (550 items)
Collection 3 (1000 items)
etc.
Now Collection 2 would have all items processed and the index could start from 551 onwards, going from 551 to 1050. With this, I could put COLLECTIONS_TO_INCLUDE=Collection 1
in .env
on the first run, then swap it with Collection 2 on the second, extract the outputs to the same folder, and all the index numbers would be unique, and with more than 1000 items.
Additionally, as a bonus for keeping the index numbers unique, this would greatly benefit #8
@rc-gr You raise an important point.
I've heard of other users having this issue and I'm currently working on a script so I can import collections for testing it myself by being able to import collections from the collection_dict as input.
However the alternative you mentioned won't really work.
The API returns all collections of a certain type. Like Generic
for AI generated and Image
for Bing images.
It will return all collections and seems to truncate after 1000 entries.
Maybe they are using some kind of pagination which I can test for myself after I finished the script for importing collections.
Could you check in the mean time if the copy button on the website returns more than 1000 entries?
You can just search for https://www.bing.com/images/create/
in the text from the clipboard in Notepad++ for example.
I think I may have created a misunderstanding here. I was just merely referring to offsetting the index on the file names. So after processing, instead of 0001_A.jpg
, 0002_B.jpg
, 0003_C.jpg
, etc. I can apply an offset of eg. 1000 so the names would generate as 1001_A.jpg
, 1002_B.jpg
, 1003_C.jpg
etc. The images remain as-is.
Could you check in the mean time if the copy button on the website returns more than 1000 entries?
On this front, I don't know how it is for others. But for me, the button refuses to work if I attempt to copy more than 54 items. I don't know why the number is this arbitrary.
For reference, the browser's Collections version is worse, as I'm only able to copy the first 24 items with "Copy Items to Clipboard".
It's strange, as I do recall for the longest time being able to copy more than 1000 items to the clipboard at one point, but only using the browser's Collections, which was why I had my presumptions and suggested the clipboard fallback in the first place.
@rc-gr I'll have to work on the import script first then. What I meant is that there's no way to continue, so I don't really see what an offset would accomplish, as it wouldn't work. You will always get the same 1000 items. But that made me think of something. In theory you could chunk the collections by dumping the collection_dict, deleting the items from the dict via another API endpoint and repeat this step until no images are left and then reimport the concatenated collection_dict at the end. That would maybe work. But this would need extensive testing to prevent permanent data loss.
I tried to download the hard limit number of images, but after changing the number from 500 to 1000 at the corrispondent line, i get at the end an error, and a zip file containing 239 images. When it was at 500, i got all 500 images in the zip. I am pretty sure DALL-E images don't get deleted, and my number of items hasn't gone down by not even 1. Anybody know what it could say '404 not found'?
@Gabriele007xx Have you tried rerunning the script at a later time? I just checked on one of the links for the failed downloads, and it shows me an image as intended.
In any case, I had suggested a workaround at #7 where the thumbnail would be downloaded instead in the event where the original fails to download for whatever reason. From what I can tell, a failed download is less likely to occur for thumbnails as I've yet to see a broken image for them in my collections.
@rc-gr @Gabriele007xx The import functionality already looks promising. I had to manually throttle it, because the backend is slow.
I still had issues for some specific images I'll have to investigate.
Once the import feature is finished I'll start on the delete feature (using the delete from collection api) which would create the basis for this feature.
However progress is slowed down by the API being a bit temperamental and because I don't have any docs for it.
Hi, I'm trying to find out more information about the collection api for a similar project I'm working on (organizing collection image data in google sheets). Is there documentation somewhere? I'm running into the same problem with the 1000 image limit via the api, however, deletion isn't really an option for my situation (unless I migrate away from bing collections ...)
@vinnyreid
There are no docs since it's not an official API.
With the deletion it was meant more as a workaround.
Operating in batches of 1000 images, downloading them, deleting them from the collections and then readding them.
But I'm having issues with adding them already, since there are some server issues at about 600 images, even when using a very low semaphore, so it's extremely slow too.
You can try out the BingCreatorCollectionImport
class, but make sure to replace CollectionId
with one of your own collections.
You can find the Id in a request in your browser, like when adding or removing an image.
Also it takes the dictionary from the https://www.bing.com/mysaves/collections/get
API, so you can just write the collection_dict
variable in the __gather_image_data
method to a JSON file for testing.
So the way I found to go around that problem is by creating collections of 1000 images at most, currently I have 4 collections (named 1, 2, 3, and 4) of roughly 1000 gens each (originally from the default collection but I moved them 1k at a time manually), and since it seems downloader can only pull the most resent gens (I say this because trying to download collection 1, or 2 returns no images, while 3 and 4 return 350 and 650 respectively) I did the following:
@AblazingHeart
Sounds like a very interesting workaround.
Basically creating a new temp collection that's 1000 images max and using that instead.
Just wish the add collection API was more robust or the get collection API more flexible, so the user wouldn't have to do this manually.
Can someone also try these steps, so I can see if I should add back the old version again? If there are more than 1000 occurrences I would add an option to use a .txt file instead.
Also, my workaround sometimes throws this error: although it seems that it resolves itself with time because in the morning I made my steps and it threw that error but I tried downloading again now and it worked.
Btw @Richard-Weiss, I can indeed copy to clipboard more than 1k items, problem is that it takes a lot of time, for me roughly 10 minutes per 1k items. Also, I remember trying the clipboard version of the downloader when you published but using it made my PC bluescreen by using 100% of my cpu, I think it had something to do with ilegal characters and or emojis in my prompts
@Richard-Weiss, now I can finally also confirm that clipboard method works, albeit less than 1000 images currently (about 200+ now). What changed this time was that I turned on clipboard history (using "Win+V" on Win 11) and observed it. It took about 10 secs for me for the copied items to show up there. Before now, when I was unaware of my disabled clipboard history, the clipboard method only seemed to work sporadically.
P.S. I have since pruned my collections because I found that my saved images now started to expire randomly and sparsely if they were generated more than a month ago, which then became severely apparent at >3 months. Thankfully, I've already downloaded them beforehand, so there's little reason for me to keep the 1000+ thumbnails without their original images lying around.
Also, to add on to @AblazingHeart's method, perhaps I should've clarified much earlier on how I've been downloading my collections when I've had more than 1000 items, which I felt is more straightforward without having to load all the images in a collection.
Access Collections via the Collections icon on Edge ("Ctrl+Shift+Y"). You'll see your collections page something like the following on the side (this is on an alt btw): Using the image above as reference, assume that Collections 1 through 4 all have 1000 images. If I were to run the program as-is (with the instructions provided on the main page), I would get 1000 images from only Collection 4 due to API limitations. What if I wanted the program to get 1000 images from Collection 1?
Open the 3 dots menu and select "Manage": All collections will now each have a handle beside it:
Drag the handle (of Collection 1 in this example) to the very top, above all other collections. Save to apply the changes: You can also confirm that the changes has been applied by going to your full Collections page:
Now if you ran the program once more, assuming you have ensured your cookie has been updated in the .env, you should now have the 1000 images downloaded from Collection 1.
Notice that there's also no need to specify which collection to download from in the config file. If Collection 1 at this point had less than 1000 images, the program would retrieve from subsequent collections down the list (ie. Collection 4, then 3, then 2 in this case) until it hits 1000 images.
@AblazingHeart I think you might be seeing that error because it seems that thumbnails can expire pretty quickly, like so: And this was just 4 days ago as of this post! Thankfully, the original image is still present in this case (via the generation page link by clicking through the thumbnail).
Because of this, with the expired thumbnail, the program is more likely to fail here if the original image also could not be retrieved prior for whatever reason. However, as long as the original image has yet to expire, a quick re-run or two should get the program to proceed successfully, as you've observed.
I've added the fallbacks and statistics now.
I've also added some code to prevent an error when the thumbnail property isn't there, so that error shouldn't be happening anymore.
I only have like 100 images or so in my own account, can someone try out having a collection with 2000+ images and using the clipboard again and wait for 1-2 minutes?
If it returns all links I would add back the alternative method with the advances I made so far, please don't use the actual old method, it creates a new instance of Firefox for each image, so that explains the hardware usage.
Using the text file would maybe take more actual time, but less effective time for the user.
I've seen that someone made a chrome plugin for it.
I've looked into it and I think I'll create a userscript you can use with GreaseMonkey, TamperMonkey etc. so it is browser agnostic.
That should also work with 1000+ images in a single collection.
I'll let you know once it has parity with the vital existing features.
can someone try out having a collection with 2000+ images and using the clipboard again and wait for 1-2 minutes?
In case this question is still interesting, I was able to do so. It may be necessary to give bing.com access to the clipboard. I had to try a few times, then at some point I got the browser prompt for granting the permission. After that I was able to paste from the clipboard.
Three cases tested:
/edit
OS: windows 10 pro x64
Browser: Edge x64 Version 120.0.2210.121
@Ruffy314 Thanks for the info.
I forgot to mention it in the other comment.
Can you also measure the time it takes?
I tested it with around 1600 images and it took 2:45 minutes.
I think they are also using something similar to the detail API for it to work, because the HTML elements don't have the necessary data.
Using that API is actually the bottleneck right now, even for my planned userscript implementation.
I think it would be feasible to implement implement using a text file, albeit a bit slow because the API gets a bit fussy if I'm sending requests too quickly and it's not even returning a 429 but a 203.
It also gets worse the more images you have, so there might be some issues if I don't set the limit low enough.
I think I'll just add some code to use a predetermined text file and add an option to the .toml to switch between using the collection API and text file.
@Richard-Weiss
The 3500 images from single collection (case 1) took about 45 seconds for me.
852 images over seven collections took about 15 seconds.
4352 images across eight collections (both sets above at once) 57 seconds.
So more or less linear time regarding number of images, maybe some overhead for each collection.
CPU intel core i5-8400T 2x 1.7 GHz, 16 GB RAM
It looks like I have to stay in the browser tab while it is copying. When I switched to a different program the clipboard did not get filled.
@Ruffy314 It's quite random and I think it depends on the current load the server has. You can see if you open the network tab that it requests the thumbnail for each item sequentially and in the end copies the text to the clipboard. From my testing from the userscript, calling the details API yourself is actually faster, even if I restrict it to 5 parallel requests. Maybe I can also improve the copy button while I'm at it. I can create a new repo with a dev branch if you are curious and want to test it out, I'll just have to implement the download section and zipping, but I have all the data I need from the details API for all my 1600 images. It's not really functional in that way, just if you are curious. Only downside of the userscript is, that I don't have access to the thumbnail URL that still works sometimes. I only have the thumbnail from the HTML and the two URLs from the detailsAPI, which are often the same.
@Ruffy314 I'm actually a bit curious how well it works with even larger collections. Can you try it for me? It adds two new button, one to scroll down to load all images and the other one to download the images. The download button just collects the detailed information for now, you can see the progress in the browser console set to verbose. Here's the current state: https://gist.github.com/Richard-Weiss/66af2f13b248ff50cd1752b7789a833b You can try it with an extension like ViolentMonkey or TamperMonkey. The only issue I'm encountering sometimes is that the detail API returns a 203 for some images and sometimes if I use it too often I can't even load something like bing.com for 5 minutes or so. I think I have to decrease the concurrency even further.
@Richard-Weiss I have not encountered any problems. After selecting the first image of my 3500 collection and clicking the new scroll down button, it took the script 41 seconds to scroll to the end. (Very nice that it automatically returns to the top of the page).
Clicking the existing select all button shows indeed 3500 elements, so nothing is missing.
Clicking then the new download selected button fetches the data for 1000 images in ~14 seconds, 49 seconds for the full 3500
@Ruffy314 That sounds good, thanks. I think I'll improve the logic for the scroll down to be more dynamic and not with a hardcoded wait time, it kinda annoyed me too. But the rest seems to be working fine. I'll add the .txt option for this repo the coming days/weeks and create a repo for the userscript once I implemented the download. Thanks for your feedback. 🙂
For example, if I have the following collections (assuming they're all valid) in the order as they are shown in My Saves: Collection 1 (550 items) Collection 2 (500 items) Collection 3 (1000 items) etc.
Only all of Collection 1 and 450 items from Collection 2 would be retrieved. The rest of Collection 2 and all other subsequent collections would be zeroed.
As it would seem, any value above 1000 for
maxItemsToFetch
is disregarded.