Build a stage 1 worker using search engine results

rom1504 commented 2 years ago

Let's fill all details on this idea by @christophschuhmann

For example:

What search engine?
what is the input data frame (columns, size, format) ?
what is the output (probably image link, caption in parquet files)?
what kind of processing need to be done ?
any limitations from search engine we forecast?
any experiment that were done with first information?

rvencu commented 2 years ago

We need to use the 3 stage workflow. The output will be to sent all the data to Postgres database via SQLalchemy engine. We should use CAH format for that. https://github.com/rvencu/crawlingathome-gpu-hcloud/blob/43eec102d3c4f08145a7704d4c65648619677768/ccpp.py#L375

The issue I have is that while we can use private workers, using crowdsourced workers will expose DB credentials and I still have no idea how to curate the output to allow saving to the database. At this point I can only operate our swarm of private workers.

while testing we can use a test table instead of the production one

rvencu commented 2 years ago

Table structure is:

create table dataset
(
    sampleid bigint      not null
        constraint dataset_pk
            primary key,
    url      text        not null,
    text     text        not null,
    license  varchar(80),
    domain   varchar(60),
    wat      integer,
    status   smallint default 0,
    illegal  boolean  default false,
    hash     varchar(32) not null,
    modified timestamp,
    url_hash varchar(32) not null
);

alter table dataset
    owner to cah;

create index dataset_status_index
    on dataset (status);

create unique index dataset_url_hash_uindex
    on dataset (url_hash);

create trigger update_customer_modtime
    before update
    on dataset
    for each row
execute procedure update_modified_column();

the trigger just updates the timestamp for last modified time

christophschuhmann commented 2 years ago

so, update from bing image search query tests: At first I got like 300 im-txt-pairs per sec with my colab code .... - Then after a few 100k samples the ip gets blocked and the rate drops to ~10 samples per sec .... still not bad ... but maybe using tor could be a good idea ... need to do some tests again

christophschuhmann commented 2 years ago

Here the general plan:

we query several image search engines at the same time, eg Bing, Google, Yandex, Duckduckgo, ... for prepared queries with small droplet (stage 1)
the queries are distributed by a tracker to the stage 1 workers. Each time a stage 1 worker get enough queries for ~ 1 h of work
The output of stage 1 is a list of imageurl - text - pairs
Stage 1 worker use multiprocessing to create processes which open a connection to the Tor network e.g. using Torpy. Each processes receives a list of queries and sends them then over this connection to the search engines. It puts a pause between each query to the same engine, to avoid getting banned quickly. the connection to tor should stay open for several requests, not just one, because it takes several sec to open a new connection to tor.
It may work for a bit without Tor, but we should try to change ips using tor, to avoid getting complaints to the droplet providers we're using
The queries are made by:
1. All combinations of (adjective of en language from NLTK) (verb / noun of en language from NLTK)
2. All entities that have wikipedia entries ( organisations, concepts, places, ... )
3. All celebrities listed on IMDB each combined with one of ~ emotional adjectives

( additionally we could get all at least x-times mentioned named entities from wikipedia, the pile, ... )

rom1504 commented 2 years ago

I expect if this starts working at scale, search engines will actively work on banning us and will succeed. I am wondering if a crawling approach wouldn't be better (and/or some kind of partnership with an existing crawling organization)

christophschuhmann commented 2 years ago

let's try tor

TheoCoombes commented 2 years ago

Happy to help out on the tracker side of this :)

Sounds very promising

LAION-AI / project-menu

Build a stage 1 worker using search engine results #7