PiSchool / enterprise-document-classification

MIT License
8 stars 5 forks source link

how to prepare my own dataset #1

Open wanghaisheng opened 6 years ago

wanghaisheng commented 6 years ago

lets say i want to classiy domain document into about 48 categories, am I create like The RVL-CDIP Dataset? what`s the proper dpi of document image ?should I process them into grayscale? 400,000 grayscale images in 16 classes, with 25,000 images per class

400,0003 grayscale images in 163 classes, with 25,000 images per class

robical commented 6 years ago

Hi,

If you want to test your methodology first, the RVL-CDIP dataset is easier to use since they have already classified all the 400K documents in 16 classes manually; if you want to further extend the classification granularity of the RVL-CDIP to 48 classes, you have several options. If you want to classify documents from a different specific domain (let's say, english literature documents), you can still start from the CNN weights trained on the RVL-CDIP dataset, and retrain the model on your classes (that you had to label manually first). The number of image per class needed strongly depends on the depth of your network; it is not mandatory to use the same structure used in the RVL-CDIP article (which is a variation of AlexNet, quite deep); it is however relevant to maintain a balanced number of training example per class, in order to avoid the introduction of intrinsic bias during the training phase (e.g. if the number of training samples is not approximately the same in all classes, the risk is to bias the model toward the classes more represented by the training set). Instead, if your purpose is to increase the level of categorization detail for the RVL-CDIP dataset, these are different options (not exaustive of course):

1) Use the OCR part of the RVL-CDIP dataset, and apply BoW or semantic quantization (e.g. word2vec) + clustering, in order to obtain subclasses 2) Apply clustering techniques on various features of each single class and use a sparse visualization technique to check if there is any other obvious additional category 3) Use 1) with doc2vec, and see if there is any way to rebuild the dataset per topic; now that would be extremely useful in real life scenarios.

Hope this would be somewhat useful. Roberto

wanghaisheng commented 6 years ago

really helpful thxs

neerajbhat98 commented 4 years ago

Hi , can you tell me how many GPUs were required for the training purpose?