aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
9.8k stars 6.67k forks source link

[Example Request] #2905

Open MirelaGTulbure opened 2 years ago

MirelaGTulbure commented 2 years ago

Describe the use case example you want to see I am using AWS SageMaker for an Image Classification task with a private workforce. I have 6252 images to label. However, I can’t seem to work out if it is possible to label more than 1,000 in one job or if I need to break up the job into 7 (6x1,000images + 252). The setting that seems to dictate the maximum number of images that I can label as part of one job is this: "MaxConcurrentTaskCount": 1000, which takes a max value of 1000. This seems counterintuitive to me based on the name of the variable. The variable that I can change to the number of images I have doesn’t seem to make a difference:"MaxHumanLabeledObjectCount": 6252

Please let me know if it is possible to label more than 1000 images and if yes, what would be the settings to change. Can you provide me with an example? If it is not possible to label more than 1000 as part of one job, can you provide a code snippet that submits multiple jobs as part of several iterations?

Thank you very much!! I look forward to your response.

How would this example be used? Please describe. I am labeling satellite images using SageMaker and a private workforce.

Describe which SageMaker services are involved AWS SageMaker for an Image Classification task

**Describe what other services (other than SageMaker) are involved*** S3

jkroll-aws commented 2 years ago

MaxConcurrentTaskCount is defined in the HumanTaskConfig API. Note that this is specifically for concurrent tasks, not the total number of tasks in the job.

Defines the maximum number of data objects that can be labeled by human workers at the same time. Also referred to as batch size. Each object may have more than one worker at one time. The default value is 1000 objects. Type: Integer Valid Range: Minimum value of 1. Maximum value of 1000. Required: No

MaxHumanLabeledObjectCount is defined in the LabelingJobStoppingConditions API:

MaxHumanLabeledObjectCount The maximum number of objects that can be labeled by human workers. Type: Integer Valid Range: Minimum value of 1. Required: No

The maximum number of items in an image classification Ground Truth labeling job is 100,000 items, so you can label all 6252 of your images in a single job. See Input Data Quotas.

For further reference, check out this example notebook which uses the MaxConcurrentTaskCount in an image classification use case: From Unlabeled Data to a Deployed Machine Learning Model: A SageMaker Ground Truth Demonstration for Image Classification.

MirelaGTulbure commented 2 years ago

Thank you so much for this. After reading what you sent me I was able to have a job with all my 6252 images by removing the MaxConcurrentTaskCount. My settings file is available here.

However, I came up on 2 other issues:

  1. My job status was "Complete with labeling errors" with only 5853 / 6252 = Labeled / total dataset objects I labeled the objects over 4 days and thus 3 expired so I expected to have 3 dataset objects missing but not 399. What could the reason for this be?

  2. When looking at the output labels quite a few of them are clearly wrong - given that I labeled them all myself this is impossible (I may have missed 1 or 2 but not hundreds) so could this be related to the issue above. What else could have gone wrong with this?

Thanks so much!!