Open rvencu opened 3 years ago
Actually, I did something similar for @DefinatelyNotSam's workers on Discord. His worker, nicknamed "Cruncher" parses entire WARC files, and has a crawling@home plugin to help out our project (yet to go live). He reformats the WARC files to match the WAT file URLs and sends them via custom endpoints I developed.
In summary, this is what the endpoints I made do:
/custom/lookup-wat
/custom/markasdone
, which marks both the shards as doneI could definitely do something similar if this is what you'd like to achieve.
Yes, if I can get both shards into the worker and upload either combined or separate results I do not mind reutilizing the same endpoint as cruncher
though... the GPU should know which is which in order to mark as done properly...
Ah it won't be using the same endpoints, however, I'll design custom ones that work directly with client instances.
Speaking with @rom1504 I learned that dataset is ideally sized for model training if we can provide files containing an entire batch of 5000-10000 samples.
At this deduplication level this is achieved with 16 shards grouped together, as I already started to use. At the same time the shard download takes 90-130 seconds and became a significant part of the entire job duration at scraper. Getting a single download and performing 2 shards in one go will cut this time in half.
I would like the server to send, whenever possible, 2 shards at once from the same wat like this:
that kind of information should also be available for the GPU node, so it knows if downloaded data correspond with single or double shard