LAION-AI / project-menu

Projects at LAION
MIT License
10 stars 4 forks source link

Build a stage 1 worker using search engine results #7

Open rom1504 opened 2 years ago

rom1504 commented 2 years ago

Let's fill all details on this idea by @christophschuhmann

For example:

rvencu commented 2 years ago

We need to use the 3 stage workflow. The output will be to sent all the data to Postgres database via SQLalchemy engine. We should use CAH format for that. https://github.com/rvencu/crawlingathome-gpu-hcloud/blob/43eec102d3c4f08145a7704d4c65648619677768/ccpp.py#L375

The issue I have is that while we can use private workers, using crowdsourced workers will expose DB credentials and I still have no idea how to curate the output to allow saving to the database. At this point I can only operate our swarm of private workers.

while testing we can use a test table instead of the production one

rvencu commented 2 years ago

Table structure is:

create table dataset
(
    sampleid bigint      not null
        constraint dataset_pk
            primary key,
    url      text        not null,
    text     text        not null,
    license  varchar(80),
    domain   varchar(60),
    wat      integer,
    status   smallint default 0,
    illegal  boolean  default false,
    hash     varchar(32) not null,
    modified timestamp,
    url_hash varchar(32) not null
);

alter table dataset
    owner to cah;

create index dataset_status_index
    on dataset (status);

create unique index dataset_url_hash_uindex
    on dataset (url_hash);

create trigger update_customer_modtime
    before update
    on dataset
    for each row
execute procedure update_modified_column();

the trigger just updates the timestamp for last modified time

christophschuhmann commented 2 years ago

so, update from bing image search query tests: At first I got like 300 im-txt-pairs per sec with my colab code .... - Then after a few 100k samples the ip gets blocked and the rate drops to ~10 samples per sec .... still not bad ... but maybe using tor could be a good idea ... need to do some tests again

christophschuhmann commented 2 years ago

Here the general plan:

( additionally we could get all at least x-times mentioned named entities from wikipedia, the pile, ... )

rom1504 commented 2 years ago

I expect if this starts working at scale, search engines will actively work on banning us and will succeed. I am wondering if a crawling approach wouldn't be better (and/or some kind of partnership with an existing crawling organization)

christophschuhmann commented 2 years ago

let's try tor

TheoCoombes commented 2 years ago

Happy to help out on the tracker side of this :)

Sounds very promising