Closed beijbom closed 3 years ago
@qiminchen : I noticed that a small fraction of jobs didn't complete. So I added some more logging and this was the error I found. I check a few and its the same error. This is really confusing to me. Why would some fraction 98% or so of jobs pass this assert, and some don't? Do recall seeing anything about this type of issue when you wrote this?
It's even more confusing since the model weights were downloaded before this job started, to a shared volume.
| Timestamp | Message
| 2020-09-28T22:49:45.849-07:00 | INFO:root:-> Received boto job for ENV {"key": "tmp/08bfc10v7t.png.2020-09-28_22:38:31.414840.feats.json.job_msg.json", "storage_type": "s3", "bucket_name": "spacer-test"}.
| 2020-09-28T22:49:45.849-07:00 | INFO:root:-> Deserializing job message location...
| 2020-09-28T22:49:45.850-07:00 | INFO:root:-> Done deserializing job message location.
| 2020-09-28T22:49:45.850-07:00 | INFO:root:-> Instantiating job message...
| 2020-09-28T22:49:45.971-07:00 | INFO:root:-> Done instantiating job message: {}.
| 2020-09-28T22:49:45.971-07:00 | INFO:root:-> Extracting features for job:regression_job.
| 2020-09-28T22:49:45.971-07:00 | INFO:root:-> Initializing EfficientNetExtractor
| 2020-09-28T22:49:46.093-07:00 | INFO:root:-> Extracting features for regression_job...
| 2020-09-28T22:49:46.094-07:00 | INFO:root:-> Cropping 1 patches...
| 2020-09-28T22:49:46.119-07:00 | INFO:root:-> Done dropping 1 patches.
| 2020-09-28T22:49:46.119-07:00 | INFO:root:-> Extracting features...
| 2020-09-28T22:49:46.464-07:00 | ERROR:root:Error executing job regression_job: Traceback (most recent call last):
| 2020-09-28T22:49:46.464-07:00 | File "/workspace/spacer/spacer/tasks.py", line 130, in process_job
| 2020-09-28T22:49:46.464-07:00 | results.append(run[job_msg.task_name](task))
| 2020-09-28T22:49:46.464-07:00 | File "/workspace/spacer/spacer/tasks.py", line 41, in extract_features
| 2020-09-28T22:49:46.464-07:00 | features, return_msg = extractor(img, msg.rowcols)
| 2020-09-28T22:49:46.464-07:00 | File "/workspace/spacer/spacer/extract_features.py", line 141, in __call__
| 2020-09-28T22:49:46.464-07:00 | feats = extract_feature(patch_list, torch_params)
| 2020-09-28T22:49:46.464-07:00 | File "/workspace/spacer/spacer/torch_utils.py", line 69, in extract_feature
| 2020-09-28T22:49:46.464-07:00 | net = load_weights(net, pyparams)
| 2020-09-28T22:49:46.464-07:00 | File "/workspace/spacer/spacer/torch_utils.py", line 43, in load_weights
| 2020-09-28T22:49:46.464-07:00 | assert sha256 == config.MODEL_WEIGHTS_SHA[pyparams['model_name']]
| 2020-09-28T22:49:46.464-07:00 | AssertionError
| 2020-09-28T22:49:46.468-07:00 | INFO:root:-> Done processing job.
| 2020-09-28T22:49:46.468-07:00 | INFO:root:-> Writing results to spacer_shakeout_results.
| 2020-09-28T22:49:46.545-07:00 | INFO:root:-> Done writing results to spacer_shakeout_results.
| 2020-09-28T22:49:46.546-07:00 | True
hmm this is super weird as I test this unit test and everything goes pretty well. It doesn't make sense if some feature extraction jobs pass the assertion while some others don't since we download the weight at the very beginning and didn't overwrite the file after...
hmm this is super weird as I test this unit test and everything goes pretty well. It doesn't make sense if some feature extraction jobs pass the assertion while some others don't since we download the weight at the very beginning and didn't overwrite the file after...
@qiminchen : Just to loop back here. The error was due to multiple jobs running at the same time with 1 downloading the model to a shared local disk while another tried to read. I have addressed this in the AWS Batch setup.
@StephenChan @qiminchen : this is ready for review.
@qiminchen : you mind taking a look at this one as well?
Am I right in assuming batch_simple.py
would be run like this?:
docker run -v </path/to/your/local/models>:/workspace/models -v ${PWD}:/workspace/spacer/ -it beijbom/pyspacer:v0.2.7 python3 scripts/aws/batch_simple.py
When I try that, I get botocore.exceptions.NoRegionError: You must specify a region.
I assume it's some config I'm missing, but I wasn't sure what. I'm trying this in a local VM, by the way, not on an EC2 instance.
All unit tests pass on my end though. Also, good to see an example of porting to boto3. That'll be useful for doing that port in CoralNet itself as well.
Am I right in assuming
batch_simple.py
would be run like this?:
I always run this outside docker, in my IDE. Can you try that? I'm not sure why they complain about the region inside the docker.
@qiminchen : I merged the other branch into this one. No other changes. U mind approving again?
I always run this outside docker, in my IDE. Can you try that? I'm not sure why they complain about the region inside the docker.
Ah, running the script from a local clone setup worked. I was able to see the jobs in the AWS Batch dashboard as they ran. Very nice!
Just a couple of numpy warnings as it started up, but not sure if they're important.
This PR doesn't actually do that much it just looks like it. ¯\(ツ)/¯
Summary Main changes are below. There are a lot of files changes since I cleaned up the logging system a bit. After this PR I still have one stylistic thing to do namely change such that
TrainClassifierMsg
directly holds the train and val data.I did find some odd results on the
efficientNet
vsvgg
runtimes -- I'll discuss that in a separate issues. Not that is is not related to this PR not to the issue I posted below last week.How to test
scripts/aws/batch_simple.py
script. It will run 10 jobs in the cloud.Details