cwerner / fastclass

Little tools to download and then weed through images, delete and classify them into groups for building deep learning image datasets (based on crawler and tkinter)
Apache License 2.0
133 stars 25 forks source link

Scraping Pinterest? #25

Open venuv opened 4 years ago

venuv commented 4 years ago

First of all, nicely done (good docs, easy install, works like a charm). Would find pinterest scraping quite useful ..

cwerner commented 4 years ago

Hi.

Thanks for using the tool. I have not had the use case myself, but if there are easy options to include this - why not? However, the tool piggy-backs on icrawler which only deals with Google, Bing and Baidu. Would you know of the required tools for this?

venuv commented 4 years ago

I could manually search for the Pinterest board of relevance to the keyword(s) I'm searching (say on 'recliners') and put the url in the excel file. Then fastclass could crawl that using a capability such as https://github.com/xjdeng/pinterest-image-scraper

cwerner commented 4 years ago

Interesting.

I quickly peeked into the mentioned package and unfortunately this requires selenium which is a pretty hefty burden for such a small package. I will think about it. Would be happy to look at a PR though if you want to give it a shot...

oezeadi commented 4 years ago

How can I use the tool to image process a of my own images in a folder? I can't seem to get it to work (I'm new)

cwerner commented 4 years ago

Hi @oezeadi ,

Not quite sure what you are trying to do? If you want to classify a bunch of images you use the fc_clean command...

oezeadi commented 4 years ago

I have a folder of my own images and I'd like to process that folder as I would using the fcd after it crawls..

cwerner commented 4 years ago

Let me get back to you tomorrow... I’ll have a look at it 👨‍💻

oezeadi commented 4 years ago

ok thanks, can't wait!

cwerner commented 4 years ago

Ok @oezeadi

I have some unfinished changes to fcc (there're also tk ui bugs I want to fix), but I just gave the current GitHub status a spin and I get the result that I expect...

So if you want to assign images to certain classes you start the command with

fcc folder/where/your/files/are

Then you use the numbers 1-9 to assign a class number to the image. You will automatically move to the next after press... If you need to change the assignment you can use the left/ right arrow keys. Once done you save the assignment by pressing the X key.
You should get a report file in the folder which has the filename and the assigned class in there.

Is this what you are seeing, too?

oezeadi commented 4 years ago

Hey, so I guess my first question should have been: does your program actually do other types of image processing (resizing, checking for right number of channels, remove duplicates, etc)? That is what I assumed fcd was doing after it crawled for images.

If yes to above, then does the new fcc you updated do these checks too? I don't know how to tell if a batch of images have been properly "checked/process"...

cwerner commented 4 years ago

Ah ok.

Right, this would make sense. I anticipated fcc to weed through the download and basically mark bad or misclassified ones (so run after fcd). I’ll have a look how to share these checks between the scripts. A bit busy next week but hope to update soon. However, if you have specific needs/ ideas I would also appreciate a PR 😉

C