aws-samples / amazon-comprehend-semi-structured-documents-annotation-tools

Other
24 stars 15 forks source link

Is my dataset randomly split every time I train a new NER model? #16

Closed jetsonearth closed 2 years ago

jetsonearth commented 2 years ago

Hi @dnlen, the dataset has been split automatically, but it seems like the same set of data has been allocated to the training/validation/test set, because after I trained a new model, I am getting the same model performance. How can I use different data for training and perform cross-validations? I was exploring to split the data using SageMaker Data Wrangler, but would it be able to work on the annotation files and the manifest file obtained from this Comprehend annotation tool?

Screen Shot 2022-07-18 at 6 02 50 PM
dnlen commented 2 years ago

For more information on your trained Comprehend model, you can go to your model in the console and go to the Application Integration tab where you can see details such as the NumberOfTrainMentions for each entity. Could you check to see if the numbers are the same for each entity for both trained models? This is to validate the perceived non-random splitting of data during training.

The annotation files and manifest file from the Comprehend annotation tool were not planned to be integrated for usage with Sagemaker Data Wrangler. It may be doable but some data manipulation may be required.

jetsonearth commented 2 years ago

Hi @dnlen - no the number are not the same.

Screen Shot 2022-07-20 at 4 17 56 PM
dnlen commented 2 years ago

Because of the different numbers for "Number of train mentions", this validates that the data allocated to train/test/validation is not the same in every training job.

We do not allow the option to split data into 3 (train/test/validation) yourself, but you can split the data into 2 to use for training/validation and test using our Split attribute https://docs.aws.amazon.com/comprehend/latest/dg/API_AugmentedManifestsListItem.html#comprehend-Type-AugmentedManifestsListItem-Split