Open rom1504 opened 2 years ago
We need to use the 3 stage workflow. The output will be to sent all the data to Postgres database via SQLalchemy engine. We should use CAH format for that. https://github.com/rvencu/crawlingathome-gpu-hcloud/blob/43eec102d3c4f08145a7704d4c65648619677768/ccpp.py#L375
The issue I have is that while we can use private workers, using crowdsourced workers will expose DB credentials and I still have no idea how to curate the output to allow saving to the database. At this point I can only operate our swarm of private workers.
while testing we can use a test table instead of the production one
Table structure is:
create table dataset
(
sampleid bigint not null
constraint dataset_pk
primary key,
url text not null,
text text not null,
license varchar(80),
domain varchar(60),
wat integer,
status smallint default 0,
illegal boolean default false,
hash varchar(32) not null,
modified timestamp,
url_hash varchar(32) not null
);
alter table dataset
owner to cah;
create index dataset_status_index
on dataset (status);
create unique index dataset_url_hash_uindex
on dataset (url_hash);
create trigger update_customer_modtime
before update
on dataset
for each row
execute procedure update_modified_column();
the trigger just updates the timestamp for last modified time
so, update from bing image search query tests: At first I got like 300 im-txt-pairs per sec with my colab code .... - Then after a few 100k samples the ip gets blocked and the rate drops to ~10 samples per sec .... still not bad ... but maybe using tor could be a good idea ... need to do some tests again
Here the general plan:
we query several image search engines at the same time, eg Bing, Google, Yandex, Duckduckgo, ... for prepared queries with small droplet (stage 1)
the queries are distributed by a tracker to the stage 1 workers. Each time a stage 1 worker get enough queries for ~ 1 h of work
The output of stage 1 is a list of imageurl - text - pairs
Stage 1 worker use multiprocessing to create processes which open a connection to the Tor network e.g. using Torpy. Each processes receives a list of queries and sends them then over this connection to the search engines. It puts a pause between each query to the same engine, to avoid getting banned quickly. the connection to tor should stay open for several requests, not just one, because it takes several sec to open a new connection to tor.
It may work for a bit without Tor, but we should try to change ips using tor, to avoid getting complaints to the droplet providers we're using
The queries are made by:
( additionally we could get all at least x-times mentioned named entities from wikipedia, the pile, ... )
I expect if this starts working at scale, search engines will actively work on banning us and will succeed. I am wondering if a crawling approach wouldn't be better (and/or some kind of partnership with an existing crawling organization)
let's try tor
Happy to help out on the tracker side of this :)
Sounds very promising
Let's fill all details on this idea by @christophschuhmann
For example: