AWS sqs deploy scripts.

beijbom commented 4 years ago

This PR will has scripts to test the new compute cluster on AWS. These scripts all rely on the spacer_test_jobs and spacer_test_results queues. These queues are isolated from the production queues but share the same compute instances, so we can test stuff like memory, runtime, etc.

Also note that this cluster is a brand new one, totally isolated from the cluster used for the current production environment. I will leave both up and running until we have completed the transition.

Action points:

[x] Setup new cluster on AWS.
[x] Add rules for scaling up and down.
[x] Move to standard queues for production and test.
[x] Run scaling shakeouts.
[x] Clean up scripts on spacer side.
[x] Enable ECS to run jobs on both test and production queue.
[x] Update docker build after https://github.com/beijbom/pyspacer/pull/15 is done.
[x] Run memory tests for large images.
[x] Run memory tests for large number of row/col locations.
[x] Add memory and image size limits and write tests.

Summary of changes:

Refactored and added scripts to scripts folder. New scripts are in aws subfolder.
Added asserts on image size and number of pixels. These were set based on experiments shown below. See this commit for details: https://github.com/beijbom/pyspacer/pull/17/commits/9bc366d69a5494cd2e50f82917da2b0956d7f063
Refactor of extract_feature in spacer/torch_utils.py to reduce memory footprint. See commit: https://github.com/beijbom/pyspacer/pull/17/commits/318dddbb129284ed072f0ffc49b2d559e47b6791.
PEP8 and imports cleanup throughout.
All tests pass on Docker.

beijbom commented 4 years ago

@StephenChan @qiminchen : I still have a few more tests to run, but the new AWS cluster is up and running. To test do: python scripts/aws_shakeout.py See doc-strings in that file for details. Output from a recent run:

-> Starting ECS shakeout script.
-> Purged 0 messages from spacer_test_results
-> Purged 0 messages from spacer_test_jobs
-> Submitting 100 jobs... 
-> 100 jobs submitted.
-> [08:50:03] Status: 99 todo, 1 ongoing, 0 done, 0 extracted
-> [08:50:20] Status: 99 todo, 0 ongoing, 1 done, 1 extracted
-> [08:50:36] Status: 98 todo, 1 ongoing, 1 done, 1 extracted
-> [08:50:52] Status: 98 todo, 0 ongoing, 2 done, 2 extracted
-> [08:51:08] Status: 97 todo, 1 ongoing, 2 done, 2 extracted
-> [08:51:24] Status: 97 todo, 0 ongoing, 3 done, 3 extracted
-> [08:51:41] Status: 97 todo, 0 ongoing, 3 done, 3 extracted
-> [08:51:57] Status: 96 todo, 1 ongoing, 3 done, 3 extracted
-> [08:52:14] Status: 96 todo, 0 ongoing, 4 done, 4 extracted
-> [08:52:30] Status: 95 todo, 1 ongoing, 4 done, 4 extracted
-> [08:52:46] Status: 95 todo, 1 ongoing, 4 done, 5 extracted
-> [08:53:03] Status: 95 todo, 0 ongoing, 5 done, 5 extracted
-> [08:53:19] Status: 94 todo, 1 ongoing, 5 done, 5 extracted
-> [08:53:36] Status: 94 todo, 0 ongoing, 6 done, 6 extracted
-> [08:53:52] Status: 94 todo, 0 ongoing, 6 done, 6 extracted
-> [08:54:08] Status: 80 todo, 14 ongoing, 6 done, 6 extracted
-> [08:54:27] Status: 73 todo, 20 ongoing, 7 done, 7 extracted
-> [08:54:43] Status: 71 todo, 2 ongoing, 27 done, 27 extracted
-> [08:54:59] Status: 52 todo, 19 ongoing, 29 done, 33 extracted
-> [08:55:15] Status: 33 todo, 19 ongoing, 48 done, 48 extracted
-> [08:55:31] Status: 30 todo, 17 ongoing, 53 done, 64 extracted
-> [08:55:47] Status: 11 todo, 19 ongoing, 70 done, 70 extracted
-> [08:56:04] Status: 9 todo, 4 ongoing, 89 done, 89 extracted
-> [08:56:20] Status: 0 todo, 9 ongoing, 91 done, 91 extracted
-> [08:56:37] Status: 0 todo, 8 ongoing, 100 done, 100 extracted
-> All jobs done, purging results queue
-> Purged 100 messages from spacer_test_results

beijbom commented 4 years ago

@qiminchen : I deployed your extractor to AWS. These are outputs from the aws_deploy.py scripts for efficientnet_b0_ver1 and vgg16_coralnet_ver1. Note that this is just 1 point per image. But the runtime difference is nuts. 5.8 seconds vs 0.3! I'm guessing this has to do with caffe initialization being slow.

-> Starting ECS feature extraction for vgg16_coralnet_ver1.
-> Purged 0 messages from spacer_test_jobs
-> Purged 0 messages from spacer_test_results
-> Submitting 100 jobs... 
-> 100 jobs submitted.
-> [15:57:21] Status: 34 todo, 18 ongoing, 48 in results queue, 58 done
-> [15:57:38] Status: 11 todo, 3 ongoing, 86 in results queue, 87 done
-> [15:57:54] Status: 0 todo, 0 ongoing, 100 in results queue, 100 done
-> All jobs done.
-> Purged 100 messages from spacer_test_results
-> Average runtime: 5.868696775436401

-> Starting ECS feature extraction for efficientnet_b0_ver1.
-> Purged 0 messages from spacer_test_jobs
-> Purged 0 messages from spacer_test_results
-> Submitting 100 jobs... 
-> 100 jobs submitted.
-> [16:01:16] Status: 38 todo, 0 ongoing, 42 in results queue, 63 done
-> [16:01:32] Status: 0 todo, 9 ongoing, 98 in results queue, 100 done
-> All jobs done.
-> Purged 100 messages from spacer_test_results
-> Average runtime: 0.28868133068084717

qiminchen commented 4 years ago

that is impressive, efficientnet_b0 with Pytorch is nearly 20 times faster than vgg16 with Caffe. Great job!!

beijbom commented 4 years ago

@StephenChan : I ran some memory stress-tests on the new cluster. The nodes (c5.large) have 4GB RAM. Checkout scripts/aws/test_memory_aws.py for details on the code, but basically I ran all combinations of:

{'vgg16_coralnet_ver1', 'efficientnet_b0_ver1'}
IMAGE_SIZES = [
    (3000, 3000),  # 10 mega pixel
    (10000, 10000),  # 100 mega pixel
    (20000, 20000),  # 400 mega pixel
]
NBR_ROWCOLS = [100, 1000, 3000]

For efficientnet we got:

[22:40:24] efficientnet_b0_ver1 (3000, 3000): 100 done in 7.37 s.
[22:41:26] efficientnet_b0_ver1 (3000, 3000): 1000 done in 71.18 s.
[22:44:30] efficientnet_b0_ver1 (3000, 3000): 3000 done in 212.10 s.
[22:41:26] efficientnet_b0_ver1 (10000, 10000): 100 done in 9.57 s.
[22:42:28] efficientnet_b0_ver1 (10000, 10000): 1000 done in 79.17 s.
[22:44:30] efficientnet_b0_ver1 (10000, 10000): 3000 done in 223.95 s.
[22:41:26] efficientnet_b0_ver1 (20000, 20000): 100 failed with: IndexError('tuple index out of range',).
[22:41:27] efficientnet_b0_ver1 (20000, 20000): 1000 failed with: IndexError('tuple index out of range',).
[22:41:26] efficientnet_b0_ver1 (20000, 20000): 3000 failed with: IndexError('tuple index out of range',)

For vgg16 we got:

[22:41:26] vgg16_coralnet_ver1 (3000, 3000): 100 done in 44.07 s.
[22:46:33] vgg16_coralnet_ver1 (3000, 3000): 1000 done in 398.37 s.
[22:59:58] vgg16_coralnet_ver1 (3000, 3000): 3000 done in 1190.59 s.
[22:41:26] vgg16_coralnet_ver1 (10000, 10000): 100 done in 49.26 s.
[22:47:34] vgg16_coralnet_ver1 (10000, 10000): 1000 done in 408.40 s.
[22:58:57] vgg16_coralnet_ver1 (10000, 10000): 3000 done in 1106.67 s.
[22:40:24] vgg16_coralnet_ver1 (20000, 20000): 100 failed with: IndexError('tuple index out of range',).
[22:40:24] vgg16_coralnet_ver1 (20000, 20000): 1000 failed with: IndexError('tuple index out of range',).
[22:40:25] vgg16_coralnet_ver1 (20000, 20000): 3000 failed with: IndexError('tuple index out of range',).

In other words, the nodes can't handle the (20000, 20000) images. Now, I don't quite understand that error message so I could dig deeper, but I figured we need to set an image size limit anyways, and (10000, 10000) (100 mega pixels) sounds like a nice round number to me. The only times the images are going to be larger than that is if they are mosaics, which don't really fit into this model of random point sampling anyways.

So my proposal is that we set a 100 mega pixel limit on spacer, and in extension also on CoralNet. I could try (15000, 15000) also if you like us to go a bit higher. Looking at https://github.com/beijbom/coralnet/blob/master/project/config/settings/base.py it seems we are setting IMAGE_UPLOAD_MAX_DIMENSIONS = (8000, 8000) on CoralNet, so in that case we could even increase that a bit if you think the CoralNet side would be fine with it. Btw, do we want to restrict both number of rows and number of cols, or just the number of pixels (nrows*ncols)?

I also propose we set a limit on the number of points per image to 3000 if we don't have such limit already. (Again, I can try with 5000 if you like to push this number higher).

As an aside: did we always have that upload restriction? I seem to recall seeing much larger images than that on CoralNet.

EDIT: Ran a few more experiments, adding here for the record:

{'vgg16_coralnet_ver1', 'efficientnet_b0_ver1'}
IMAGE_SIZES = [
    (5000, 5000),  
    (10000, 10000), 
    (15000, 15000),  
]
NBR_ROWCOLS = [100, 1000, 3000, 5000]

Results:

[23:26:33] efficientnet_b0_ver1 (5000, 5000): 100 done in 8.35 s.
[23:27:35] efficientnet_b0_ver1 (5000, 5000): 1000 done in 74.44 s.
[23:30:42] efficientnet_b0_ver1 (5000, 5000): 3000 done in 223.21 s.
[23:32:47] efficientnet_b0_ver1 (5000, 5000): 5000 done in 367.47 s.
[23:27:34] efficientnet_b0_ver1 (10000, 10000): 100 done in 10.58 s.
[23:28:36] efficientnet_b0_ver1 (10000, 10000): 1000 done in 73.69 s.
[23:30:41] efficientnet_b0_ver1 (10000, 10000): 3000 done in 230.79 s.
\[23:32:47] efficientnet_b0_ver1 (10000, 10000): 5000 done in 370.74 s.
[23:27:35] efficientnet_b0_ver1 (15000, 15000): 100 done in 18.82 s.
[23:28:36] efficientnet_b0_ver1 (15000, 15000): 1000 done in 78.86 s.
[23:31:45] efficientnet_b0_ver1 (15000, 15000): 3000 done in 238.40 s.
[23:33:50] efficientnet_b0_ver1 (15000, 15000): 5000 done in 374.94 s.

[23:26:33] vgg16_coralnet_ver1 (5000, 5000): 100 done in 47.09 s.
[23:32:46] vgg16_coralnet_ver1 (5000, 5000): 1000 done in 386.83 s.
[00:16:56] vgg16_coralnet_ver1 (5000, 5000): 3000 done in 1188.86 s.
[00:16:56] vgg16_coralnet_ver1 (5000, 5000): 5000 done in 1993.63 s.
[23:27:35] vgg16_coralnet_ver1 (10000, 10000): 100 done in 51.52 s.
[23:32:46] vgg16_coralnet_ver1 (10000, 10000): 1000 done in 402.50 s.
[00:16:57] vgg16_coralnet_ver1 (10000, 10000): 3000 done in 1176.29 s.
[00:16:57] vgg16_coralnet_ver1 (10000, 10000): 5000 done in 1934.98 s.
[23:27:34] vgg16_coralnet_ver1 (15000, 15000): 100 done in 66.44 s.
[23:33:50] vgg16_coralnet_ver1 (15000, 15000): 1000 done in 418.98 s.

StephenChan commented 4 years ago

Cool, it's definitely nice to have some stress-test results.

[22:41:26] efficientnet_b0_ver1 (20000, 20000): 100 failed with: IndexError('tuple index out of range',).

My first thought here is that Pillow or some other library might have a maximum accepted image resolution. Just a guess though.

... I figured we need to set an image size limit anyways, and (10000, 10000) (100 mega pixels) sounds like a nice round number to me. The only times the images are going to be larger than that is if they are mosaics, which don't really fit into this model of random point sampling anyways.

my proposal is that we set a 100 mega pixel limit on spacer, and in extension also on CoralNet. I could try (15000, 15000) also if you like us to go a bit higher. Looking at https://github.com/beijbom/coralnet/blob/master/project/config/settings/base.py it seems we are setting IMAGE_UPLOAD_MAX_DIMENSIONS = (8000, 8000) on CoralNet, so in that case we could even increase that a bit if you think the CoralNet side would be fine with it.

I think 10000 x 10000 sounds perfectly fine, it's not that much of a jump from 8000 x 8000. 15000 x 15000 is probably okay as well. But if there isn't a significant use case for going higher than 10k, then I'll leave it up to you if you want to test more resolutions.

In terms of website performance, the main thing I would be concerned with is the annotation tool, particularly on zooming in/out and adjusting brightness/contrast. However, I'm currently finding both actions to be fairly quick even at 8000 x 8000 (on my old laptop).

do we want to restrict both number of rows and number of cols, or just the number of pixels (nrows*ncols)?

In terms of my website performance concerns above, only number of pixels should matter. Of course, just make sure 50000 x 1000 doesn't get an IndexError in spacer.

As an aside: did we always have that upload restriction? I seem to recall seeing much larger images than that on CoralNet.

It was implemented in 2016 (https://github.com/beijbom/coralnet/commit/25a52c4141279e33ea263684b009439797394df2).

I also propose we set a limit on the number of points per image to 3000 if we don't have such limit already. (Again, I can try with 5000 if you like to push this number higher).

We have a limit, and it's currently 1000 points per image. More points will definitely slow down the annotation tool responsiveness in all aspects, and 1000 points is already really pushing it, last I checked. (This may be improvable with annotation tool refactoring, but I once again point to https://github.com/beijbom/coralnet/issues/55 as a prerequisite.)

However, I don't see particular problems with increasing the limit for the deploy API. We'd just want to take another look at the API rate limiting scheme. Instead of "max 100 images per request, max 1000 points per image" we could have, for example, "max 100 images and 100,000 points per request, max 5000 points per image".

beijbom commented 4 years ago

Thanks @StephenChan . All great points. I ran a few more tests and it seems we could go up to 15000x15000 for efficientnet. And up to 5000 points per image for most image sizes (both image size and nbr points per image contribute to memory usage). For simplicity I think I will just set the limit to 10000x10000 pixels and 1000 points per image. If we see a use case down the road, for deploy or server, we can always increase it later. I figure it is better to be safe than sorry. Does that sound ok?

A few notes:

Re. points per image. Can you point me to where we set this limit? I don't see it in the config files.
You can keep (8000, 8000) on CoralNet, up to you. It's ok that the spacerlimit > coralnet limit. Just not the other way around... We can increase later if there is an ask.
Can you crawl the production server for instances where we have >1000 points per image or > 8000*8000 pixels per image that may have snuck in before we imposed the limits. I just want to make sure we don't break the backend if folks decide to re-process them when we roll out the new backend.

StephenChan commented 4 years ago

For simplicity I think I will just set the limit to 10000x10000 pixels and 1000 points per image. If we see a use case down the road (for deploy or server, we can always increase it later). I figure it is better to be safe than sorry. Does that sound ok? ... You can keep (8000, 8000) on CoralNet, up to you. It's ok that the spacerlimit > coralnet limit. Just not the other way around... We can increase later if there is an ask.

Sounds good. I'll just keep the CoralNet limits as-is for now, then.

Re. points per image. Can you point me to where we set this limit? I don't see it in the config files.

Yeah, it's not currently a config variable:

Can you crawl the production server for instances where we have >1000 points per image or > 8000*8000 pixels per image that may have snuck in before we imposed the limits. I just want to make sure we don't break the backend if folks decide to re-process them when we roll out the new backend.

Sure. The 1000 points limit was imposed really early, if not right from the beginning, so that's less likely to be a problem I think. But I can check both.

beijbom commented 4 years ago

Yeah, it's not currently a config variable:

For sources: https://github.com/beijbom/coralnet/blob/4b9540d7592286cebb3c805851d6ac14ba670fab/project/images/forms.py#L359

For the API: https://github.com/beijbom/coralnet/blob/53888ebf2e43a73217a769348c08fc6fa854f13a/project/vision_backend_api/forms.py#L62

Cool. How about uploads? Do you check there also?

beijbom commented 4 years ago

@qiminchen @StephenChan : this PR is ready for final review. I have updated the PR description at the top.

qiminchen commented 4 years ago

All test cases pass in both docker run and local, Caffe test skipped in local due to Caffe unavailability. Nice work!

StephenChan commented 4 years ago

Sorry for the delay here:

Can you crawl the production server for instances where we have >1000 points per image or > 8000*8000 pixels per image that may have snuck in before we imposed the limits. I just want to make sure we don't break the backend if folks decide to re-process them when we roll out the new backend.

No images with >1000 points.

3 images with resolutions exceeding 8000 x 8000:

Image 121792 - 15771 x 7624
Image 228757 - 5418 x 21913
Image 333987 - 9726 x 8545

How about uploads? Do you check [the 1000 point limit] there also?

...No, good catch. I'll make an issue for that. I guess no one's exceeded this limit yet because CPCe's limit is 500.

beijbom commented 4 years ago

Sorry for the delay here:

Can you crawl the production server for instances where we have >1000 points per image or > 8000*8000 pixels per image that may have snuck in before we imposed the limits. I just want to make sure we don't break the backend if folks decide to re-process them when we roll out the new backend.

No images with >1000 points.

great!

3 images with resolutions exceeding 8000 x 8000:

Image 121792 - 15771 x 7624

Image 228757 - 5418 x 21913

Image 333987 - 9726 x 8545

Ok. The first two are above 100 Mega pixels. Hmm. I actually know the owner of both those images. Let me ask him to delete them.

How about uploads? Do you check [the 1000 point limit] there also?

...No, good catch. I'll make an issue for that. I guess no one's exceeded this limit yet because CPCe's limit is 500.

It's not just CPCe, right? People can upload annotations using CSV files. So we could have a power-user that generates his own 2000 points. Unlikely, obviously, but possible.

StephenChan commented 4 years ago

It's not just CPCe, right? People can upload annotations using CSV files. So we could have a power-user that generates his own 2000 points. Unlikely, obviously, but possible.

Yeah, that's possible. I just figured a lot of folks are used to a lower point limit due to CPCe, even if they're not importing directly from CPC files.

beijbom commented 4 years ago

3 images with resolutions exceeding 8000 x 8000:

Image 121792 - 15771 x 7624

Image 228757 - 5418 x 21913

Image 333987 - 9726 x 8545

Ok. The first two are above 100 Mega pixels. Hmm. I actually know the owner of both those images. Let me ask him to delete them.

@StephenChan : FYI. I talked to John and those images are deleted.

coralnet / pyspacer

AWS sqs deploy scripts. #17