allow for complete wat instead of shards

rvencu commented 3 years ago

Speaking with @rom1504 I learned that dataset is ideally sized for model training if we can provide files containing an entire batch of 5000-10000 samples.

At this deduplication level this is achieved with 16 shards grouped together, as I already started to use. At the same time the shard download takes 90-130 seconds and became a significant part of the entire job duration at scraper. Getting a single download and performing 2 shards in one go will cut this time in half.

I would like the server to send, whenever possible, 2 shards at once from the same wat like this:

client                                                                 server
client.newJob()
                                      get shard 0 if shard 1 is unavailable
                                      or
                                      get shard 1 if shard 0 is unavailable
                                      or
                                      get both shards if both are available
client.jobComplete(linktocomplete)
                                      server register one or 2 jobs completed

that kind of information should also be available for the GPU node, so it knows if downloaded data correspond with single or double shard

TheoCoombes commented 3 years ago

Actually, I did something similar for @DefinatelyNotSam's workers on Discord. His worker, nicknamed "Cruncher" parses entire WARC files, and has a crawling@home plugin to help out our project (yet to go live). He reformats the WARC files to match the WAT file URLs and sends them via custom endpoints I developed.

In summary, this is what the endpoints I made do:

The URL to the WAT he's passing is sent to the server via /custom/lookup-wat
The endpoint returns the shard numbers, as well as the shard data for both the shards in that WAT.
Cruncher compiles the images and creates the same files other workers generate
Cruncher then marks the job as done via /custom/markasdone, which marks both the shards as done

I could definitely do something similar if this is what you'd like to achieve.

rvencu commented 3 years ago

Yes, if I can get both shards into the worker and upload either combined or separate results I do not mind reutilizing the same endpoint as cruncher

though... the GPU should know which is which in order to mark as done properly...

TheoCoombes commented 3 years ago

Ah it won't be using the same endpoints, however, I'll design custom ones that work directly with client instances.

TheoCoombes / crawlingathome

allow for complete wat instead of shards #10