bwarren2 / datadrivendota

Codebase for dota analytics
Other
0 stars 0 forks source link

Bulk upload to s3? #568

Closed bwarren2 closed 8 years ago

bwarren2 commented 8 years ago

Currently a full replay parse take ~3 minutes with 3 parsing workers active, IIRC ~9 minutes with 1 active. This is a really long time; YASP goes end-to-end in 20s. Our parse step (most analogous to what they are doing) also takes about that long; this big issue is making many, many PUTs to s3 in series. Is some sort of bulk upload possible? Is there a multithreading approach to this, in either python or java? See also #567.

wlonk commented 8 years ago

Boto has bulk operations on S3, yes.

bwarren2 commented 8 years ago

One thing I have run up against without handling well is memory footprint. We are currently using 2x workers for this step because of memory overruns, and even then things tend to go over sometimes. I suspect my entire implementation is bad and need of reconsideration in light of these constraints.

wlonk commented 8 years ago

Streaming will be our friend.

bwarren2 commented 8 years ago

Cutting down number of files and using horizontal scale keeps things under 3 minutes.