Loading a large number of SWCs from s3 is slow

carshadi commented 1 year ago

Using the "Load Linux SWC Folder into New Workspace on Sample" option, loading 2,354 SWC files from an S3 bucket takes ~2m. Assuming linear scaling, loading 2 million SWCs would take ~33hr . Clearly, loading data from S3 is expected to be slower than from a local or network drive, but it would be great if there were ways to speed it up (parallelize somehow?). It also seems tricky to get that number of files onto the EC2 instance's local disk via Temporary Files, OneDrive or Google Drive. Perhaps if there were a way to unzip files within the AppStream instance, that would help?

The swcs I used are here s3://aind-msma-morphology-data/test/from_google_exaSPIM_609281_2022-11-03_13-49-18-training-data_n5_whole-brain_consensus_1000/

porterbot commented 1 year ago

Will generate benchmark of dense SWCs on Janelia instance to compare import load times.

porterbot commented 1 year ago

Importing set of 2355 neurons into HortaCloud took 2:15. Same set imported into Janelia instance took 16 seconds, about 8 times faster.

porterbot commented 1 year ago

I did a test for Hortacloud with the same set of neurons but from local disk. It took about 20-21 seconds. A bit slower, but clearly s3fs is causing some issues for the swc import.

cgoina commented 1 year ago

the running time SWCImport service of /data/s3/janelia-mouselight-imagery/reconstructions/2018-10-01/build-brain-output/frags-with-5-or-more-nodes/as-swcs.tar that contains ~110K SWC entries was ~10s from the time the service was queued to the time it was completed. This was done using Postman not from the workstation.

JaneliaSciComp / hortacloud

Loading a large number of SWCs from s3 is slow #31