TheoCoombes / crawlingathome

A client library for LAION's effort to filter CommonCrawl with CLIP, building a large scale image-text dataset.
http://crawling.at
MIT License
31 stars 7 forks source link

allow for complete wat instead of shards #10

Open rvencu opened 3 years ago

rvencu commented 3 years ago

Speaking with @rom1504 I learned that dataset is ideally sized for model training if we can provide files containing an entire batch of 5000-10000 samples.

At this deduplication level this is achieved with 16 shards grouped together, as I already started to use. At the same time the shard download takes 90-130 seconds and became a significant part of the entire job duration at scraper. Getting a single download and performing 2 shards in one go will cut this time in half.

I would like the server to send, whenever possible, 2 shards at once from the same wat like this:

client                                                                 server
client.newJob()
                                      get shard 0 if shard 1 is unavailable
                                      or
                                      get shard 1 if shard 0 is unavailable
                                      or
                                      get both shards if both are available
client.jobComplete(linktocomplete)
                                      server register one or 2 jobs completed

that kind of information should also be available for the GPU node, so it knows if downloaded data correspond with single or double shard

TheoCoombes commented 3 years ago

Actually, I did something similar for @DefinatelyNotSam's workers on Discord. His worker, nicknamed "Cruncher" parses entire WARC files, and has a crawling@home plugin to help out our project (yet to go live). He reformats the WARC files to match the WAT file URLs and sends them via custom endpoints I developed.

In summary, this is what the endpoints I made do:

I could definitely do something similar if this is what you'd like to achieve.

rvencu commented 3 years ago

Yes, if I can get both shards into the worker and upload either combined or separate results I do not mind reutilizing the same endpoint as cruncher

though... the GPU should know which is which in order to mark as done properly...

TheoCoombes commented 3 years ago

Ah it won't be using the same endpoints, however, I'll design custom ones that work directly with client instances.