Better logging, migrate to boto3, simply submit_classifier call.

beijbom commented 3 years ago

This PR doesn't actually do that much it just looks like it. ¯\(ツ)/¯

Summary Main changes are below. There are a lot of files changes since I cleaned up the logging system a bit. After this PR I still have one stylistic thing to do namely change such that TrainClassifierMsg directly holds the train and val data.

I did find some odd results on the efficientNet vs vgg runtimes -- I'll discuss that in a separate issues. Not that is is not related to this PR not to the issue I posted below last week.

How to test

Run unit-tests in docker and IDE as usual.
Run the scripts/aws/batch_simple.py script. It will run 10 jobs in the cloud.

Details

[x] Use boto3 everywhere.
[x] Expand AWS batch regression tests.
[x] Clean up log messages.

beijbom commented 3 years ago

@qiminchen : I noticed that a small fraction of jobs didn't complete. So I added some more logging and this was the error I found. I check a few and its the same error. This is really confusing to me. Why would some fraction 98% or so of jobs pass this assert, and some don't? Do recall seeing anything about this type of issue when you wrote this?

It's even more confusing since the model weights were downloaded before this job started, to a shared volume.

  | Timestamp | Message
  | 2020-09-28T22:49:45.849-07:00 | INFO:root:-> Received boto job for ENV {"key": "tmp/08bfc10v7t.png.2020-09-28_22:38:31.414840.feats.json.job_msg.json", "storage_type": "s3", "bucket_name": "spacer-test"}.
  | 2020-09-28T22:49:45.849-07:00 | INFO:root:-> Deserializing job message location...
  | 2020-09-28T22:49:45.850-07:00 | INFO:root:-> Done deserializing job message location.
  | 2020-09-28T22:49:45.850-07:00 | INFO:root:-> Instantiating job message...
  | 2020-09-28T22:49:45.971-07:00 | INFO:root:-> Done instantiating job message: {}.
  | 2020-09-28T22:49:45.971-07:00 | INFO:root:-> Extracting features for job:regression_job.
  | 2020-09-28T22:49:45.971-07:00 | INFO:root:-> Initializing EfficientNetExtractor
  | 2020-09-28T22:49:46.093-07:00 | INFO:root:-> Extracting features for regression_job...
  | 2020-09-28T22:49:46.094-07:00 | INFO:root:-> Cropping 1 patches...
  | 2020-09-28T22:49:46.119-07:00 | INFO:root:-> Done dropping 1 patches.
  | 2020-09-28T22:49:46.119-07:00 | INFO:root:-> Extracting features...
  | 2020-09-28T22:49:46.464-07:00 | ERROR:root:Error executing job regression_job: Traceback (most recent call last):
  | 2020-09-28T22:49:46.464-07:00 | File "/workspace/spacer/spacer/tasks.py", line 130, in process_job
  | 2020-09-28T22:49:46.464-07:00 | results.append(run[job_msg.task_name](task))
  | 2020-09-28T22:49:46.464-07:00 | File "/workspace/spacer/spacer/tasks.py", line 41, in extract_features
  | 2020-09-28T22:49:46.464-07:00 | features, return_msg = extractor(img, msg.rowcols)
  | 2020-09-28T22:49:46.464-07:00 | File "/workspace/spacer/spacer/extract_features.py", line 141, in __call__
  | 2020-09-28T22:49:46.464-07:00 | feats = extract_feature(patch_list, torch_params)
  | 2020-09-28T22:49:46.464-07:00 | File "/workspace/spacer/spacer/torch_utils.py", line 69, in extract_feature
  | 2020-09-28T22:49:46.464-07:00 | net = load_weights(net, pyparams)
  | 2020-09-28T22:49:46.464-07:00 | File "/workspace/spacer/spacer/torch_utils.py", line 43, in load_weights
  | 2020-09-28T22:49:46.464-07:00 | assert sha256 == config.MODEL_WEIGHTS_SHA[pyparams['model_name']]
  | 2020-09-28T22:49:46.464-07:00 | AssertionError
  | 2020-09-28T22:49:46.468-07:00 | INFO:root:-> Done processing job.
  | 2020-09-28T22:49:46.468-07:00 | INFO:root:-> Writing results to spacer_shakeout_results.
  | 2020-09-28T22:49:46.545-07:00 | INFO:root:-> Done writing results to spacer_shakeout_results.
  | 2020-09-28T22:49:46.546-07:00 | True

qiminchen commented 3 years ago

hmm this is super weird as I test this unit test and everything goes pretty well. It doesn't make sense if some feature extraction jobs pass the assertion while some others don't since we download the weight at the very beginning and didn't overwrite the file after...

https://github.com/beijbom/pyspacer/blob/47f7027578ef702e26e23942bbe3c2d97a7e45dc/spacer/tests/test_extract_features.py#L270-L287

beijbom commented 3 years ago

hmm this is super weird as I test this unit test and everything goes pretty well. It doesn't make sense if some feature extraction jobs pass the assertion while some others don't since we download the weight at the very beginning and didn't overwrite the file after...

https://github.com/beijbom/pyspacer/blob/47f7027578ef702e26e23942bbe3c2d97a7e45dc/spacer/tests/test_extract_features.py#L270-L287

@qiminchen : Just to loop back here. The error was due to multiple jobs running at the same time with 1 downloading the model to a shared local disk while another tried to read. I have addressed this in the AWS Batch setup.

beijbom commented 3 years ago

@StephenChan @qiminchen : this is ready for review.

beijbom commented 3 years ago

@qiminchen : you mind taking a look at this one as well?

StephenChan commented 3 years ago

Am I right in assuming batch_simple.py would be run like this?:

docker run -v </path/to/your/local/models>:/workspace/models -v ${PWD}:/workspace/spacer/ -it beijbom/pyspacer:v0.2.7 python3 scripts/aws/batch_simple.py

When I try that, I get botocore.exceptions.NoRegionError: You must specify a region. I assume it's some config I'm missing, but I wasn't sure what. I'm trying this in a local VM, by the way, not on an EC2 instance.

All unit tests pass on my end though. Also, good to see an example of porting to boto3. That'll be useful for doing that port in CoralNet itself as well.

beijbom commented 3 years ago

Am I right in assuming batch_simple.py would be run like this?:

I always run this outside docker, in my IDE. Can you try that? I'm not sure why they complain about the region inside the docker.

beijbom commented 3 years ago

@qiminchen : I merged the other branch into this one. No other changes. U mind approving again?

StephenChan commented 3 years ago

I always run this outside docker, in my IDE. Can you try that? I'm not sure why they complain about the region inside the docker.

Ah, running the script from a local clone setup worked. I was able to see the jobs in the AWS Batch dashboard as they ran. Very nice!

Just a couple of numpy warnings as it started up, but not sure if they're important.

coralnet / pyspacer

Better logging, migrate to boto3, simply submit_classifier call. #30