dhlab-epfl / dhSegment

Generic framework for historical document processing
https://dhlab-epfl.github.com/dhSegment
GNU General Public License v3.0
370 stars 116 forks source link

Help Wanted - Training Datasets Questions #24

Open Jim-Salmons opened 5 years ago

Jim-Salmons commented 5 years ago

I am developing #MAGAZINEgts, a ground-truth storage format based on an ontological "stack" of #cidocCRM/FRBRoo/PRESSoo. This format uses a metamodel subgraph design pattern of fine-grained PRESSoo Issuing Rules to prescribe complex document structures. For example, the Advertising Model describes the size, shape, position, and other features such as the page grid of the containing page, the number of colors in the ad, and whether the as has margin 'bleed', etc. The reference implementation of #MAGAZINEgts is being evolved on the 48-issue collection of Softalk magazine currently available at the Internet Archive.

Using our prototype metadata discovery and curation tools -- the PrintPageNumber-to-ImageID mapper and the "Ad Ferret" -- we have curated a detailed model of all 7,000+ ads in Softalk magazine. We are integrating these two tools into an expanded application called the FactMiners Toolkit. The impetus for this new version of our #MAGAZINEgts-compatible tools is our interest in incorporating the generation of #dhSegment training datasets. An interesting feature of this training dataset page and labeled-mask image generation workflow will be our ability to use the metamodel subgraph of complex document structures to generate synthetic training dataset page/mask images for under-sampled cases/labels.

While we will undoubtedly have additional questions about guidelines for generating #dhSegment training datasets, I'd like your insights on these three basic questions that I need to understand to continue development of our toolkit:

  1. Should a training dataset for a #dhSegment model to be trained to recognize magazine ads include the 'no ad' case/label? (i.e. a label image that is all background color w/ no class/label color-coded mask bounding box for a page that does not have an ad... the 'anti'-case IOW)

  2. If an ad is full-page w/ margin bleed, would its training image mask be all case/label color with no background color visible?

  3. Magazine ads vary by size and shape constrained by page grid columns and allowable positions. Can a training dataset of multiple classes/color-assignments, be composed of individual mask images where only one of the many "watched for" cases/labels is found per training image/mask instance? For example, can 'red' be the color for a 1/4-page vertical ad and 'blue' be the color for 1/2-page horizontal ads and the model to be trained will learn to distinguish size- and shape-based granularity of the document structure model and not just, "Yes, there is an ad of some kind on this page" which would be the case if all size/shape ads were masked by the same color/class?

In closing... a more generic 'help wanted' ask here would be for any pointers to papers, datasets, or other on-line resources to better understand the assessment and handling of training dataset balancing, particularly as you have faced this issue in your own #dhSegment experiments.

This current activity is the subject of my proposed #DATeCH2019 submission, "#MAGAZINEgts and #dhSegment: Using a Metamodel Subgraph to Generate Synthetic Data of Under-Sampled Complex Document Structures for Machine-Learning." However, it does not appear that the January 20th deadline for full-paper submissions will be extended. So my research is full-speed-ahead although I am increasingly aware that this paper will have to find another venue for sharing and possible publication.

Thank you for your timely reply and keep up the GREAT work. #dhSegment is not only a great contribution to the historic document text- and data-mining domain, but it will undoubtedly be a significant technology resource for the Time Machine FET Flagship project.

Happy-Healthy Vibes from Colorado USA, -- Jim --

P.S. Here is a screenshot of the Ad Ferret in use and a pivot-table of the counts of various ad size/shape in Softalk magazine.

factminers_ad_ferret_1

magazinegts_softalk_adspec_pivotables

solivr commented 5 years ago

Hello @Jim-Salmons,

Happy to see your project going forward! To answer your questions :

  1. Should a training dataset for a #dhSegment model to be trained to recognize magazine ads include the 'no ad' case/label? (i.e. a label image that is all background color w/ no class/label color-coded mask bounding box for a page that does not have an ad... the 'anti'-case IOW)

Unless all the magazine pages contain ads, you should indeed include the 'no-ad' label and have samples of 'no-ad' pages in your training data (as well as in your evaluation and testing data). The data you use to train a model should, as much as possible, be representative of the data you will process, so if there are 'no-ad' pages, the system must see some examples of these during the training.

  1. If an ad is full-page w/ margin bleed, would its training image mask be all case/label color with no background color visible?

dhSegment is a pixel-wise segmenter, so if all the pixels of your image belong to the the class 'ad', then yes its image mask should be assigned the 'ad' label color. It shouldn't be a problem if there is no 'background' color in the annotated mask.

  1. Magazine ads vary by size and shape constrained by page grid columns and allowable positions. Can a training dataset of multiple classes/color-assignments, be composed of individual mask images where only one of the many "watched for" cases/labels is found per training image/mask instance? For example, can 'red' be the color for a 1/4-page vertical ad and 'blue' be the color for 1/2-page horizontal ads and the model to be trained will learn to distinguish size- and shape-based granularity of the document structure model and not just, "Yes, there is an ad of some kind on this page" which would be the case if all size/shape ads were masked by the same color/class?

I don't have a definite answer to this question, but I see two strategies :

  1. First have a model that detects ads (no matter which shape/size) they have and then have another classification method that separates ads depending on their shape/size.
  2. Train a model that has 2 classes : vertical and horizontal ads (independently of their size), and then have another method separating ads with different sizes for each of both classes. I wouldn't mix classes for shape/ orientation and size of ads.

You may have to run a small experiment with both approaches to see which one performs better...

In closing... a more generic 'help wanted' ask here would be for any pointers to papers, datasets, or other on-line resources to better understand the assessment and handling of training dataset balancing, particularly as you have faced this issue in your own #dhSegment experiments.

I don't think of a particular reference on this topic, but I found this paper which may be an entry point : H.He and E. Gargia, Learning from Imbalanced Data From my experience, you should have your training/testing data being representative of the whole data you want to work with. Sometimes it may also happen that your model has trouble segmenting a particular case, and usually annotating a few examples of the problematic case improves the segmentation.

Hope this helps and good luck for your DATeCH submission!

Jim-Salmons commented 5 years ago

Hello Sofia @solivr! :-) And by extension Frederick, Beniot @SeguinBe, Dario, Isabel, Maude, etc. hello from Colorado USA! :-)

Thank you, Sofia, for your timely and thoughtful reply. You have given me helpful guidance to move the FactMiners and Softalk Apple Project forward. By way of an update, let me give you Good Folks some additional information about the focus of my current efforts that focus on bringing #dhSegment into the programmatic and ground-truth workflows of our research...

Here is an animated GIF (no video player required) that shows the extension of FactMiners "Ad Ferret" into the FactMiners Toolkit as we broaden our Python-based metadata discovery and validation tools that implement our vision for the #MAGAZINEgts ground-truth storage format for magazines, newspaper, and other forms of serial publications.

fmtk_ml_tim_demo

As you watch this looping GIF, the Toolkit opens in the 'Ad Ferret' task/workflow profile. When I switch to the 'Machine Learning Training Image Maker' task profile, the currently viewed page image is replaced by an image of that page scaled to the "max-pixels" setting for the current ML Dataset to be worked on. In this case, max-pixels is set to 1 million, the rescaled image is displayed and we scroll to the position where the ad is on that page. The resizable rubberband rectangle around the ad is the "predicted" bounding box for that size, shape, and position of such an ad based on the PRESSoo Issuing Rules in the #MAGAZINEgts metamodel subgraph that describes the Advertising Model for Softalk magazine.

Once we've entered the Training Image Maker task profile, we continue to work our way through the dataset of ground-truthed ads to refine the bounding boxes for their actual, not predicted, position on the page. As you see in the GIF, saving and moving on to the next page automatically generates the max-pixel-sized page image and its label/mask image in their respective #dhSegment-suggested model-training directories. In addition to generating these training images, we write entries into the appropriate ML Dataset in the Metamodel partition of the #MAGAZINEgts file. The two main sections in our ML Dataset are highlighted in this screenshot:

fmtk_ml_tim_maggts_1

As expected, an important element of this dataset is the Label_map (highlight 1 above). You will see the essential information needed by #dhSegment in the ML_label element. This information which includes the label name and the color to be assigned to the bounding-box rectangles in label images to be generated. What is likely unexpected is the Issuing Rule subelement which includes an XPath location that links the to-be-trained label case to the specific "prior knowledge" (more on this soon) about the document structures of that make up, in this case, Softalk magazine's Advertising Model. As the most basic "mono-label" dataset, the image above uses the special value 'all' to indicate that all, in this case sixteen, cases at this XPath level of the metamodel subgraph are to be included in this single label map of the Softalk ads.

By way of example to further explain the Label_map entries in #MAGAZINEgts, here is a screenshot of a Label_map that could be used to train a #dhSegment model to "spot" 1/2-page ads:

fmtk_ml_tim_maggts_3

With an understanding of the linkage between the to-be ML_dataset's Label_map and the #MAGAZINEgts metamodel subgraph describing this publication's document structures, we can turn our attention to the second highlighted subelement in the ML_dataset screenshot above. In each entry for an ad in Softalk, we write an ML_training_img_spec entry which includes the filename and extension used for both the page and its labeled mask images in their respective training subdirectories. The subelements within this training image spec include the max-pixel-based dimensions of the page and label images together with two bounding-box rectangle entries; one for the "predicted" rectangle based on the idealized Advertising Model in the #MAGAZINEgts metamodel subgraph, and the "actual" rectangle which is a ground-truth measure done by way of the FactMiners Toolkit workflow showcased in the GIF-based demo above. We are tracking the predicted and actual values so we can further refine the to-be-trained #dhSegment model to be more informed in making its ad-spotting locations when the model eventually does ad-spotting without the benefit of a ground-truth dataset that pre-identifies the size, shape, and relative location of ads in a publication.

To better understand how we can use #MAGAZINEgts ground-truth storage format and the FactMiners Toolkit to address #dhSegment training model dataset imbalance issues, it is time to revisit my above mention of how the #MAGAZINEgts metamodel subgraph can serve as a kind of "prior knowledge" that may be able to help #dhSegment models to be even more accurate and easier to train. To make the general case for the use of prior knowledge in ML model training, I defer to FAU's brilliant Andreas Maier et al and their recent #ICPR2018 paper, "Precision Learning: Towards Use of Known Operators in Neural Networks" (https://arxiv.org/abs/1712.00374). In this paper and in Andreas' guest post on MarketTechWatch, "Does deep learning always have to reinvent the wheel?" (https://goo.gl/w7uUDT), Andreas and his fellow researchers are exploring how the incorporation of prior knowledge can reduce parameters and reduce maximal training error bounds when training neural networks.

In Andreas' blog post meant especially for non-technical business leaders, he uses a "hello, world" simple-case example in his explanation of the #PrecisionLearning approach to incorporation of prior knowledge. Why not, for example, incorporate a fast Fourier operator into a neural network rather than require the network to learn such a transform operation from scratch? While this example is intuitive and compact, Andreas leaves it up to the motivated reader to imagine more real world applications in other domains of interest.

I believe that the #MAGAZINEgts design, in this case, the relationship between its ML_dataset metadata and the Issuing Rules for document structures in its Metamodel subgraph provide precisely the kind of "hook" for bringing #PrecisionLearning prior knowledge into the OCR and OLR of serial publications with complex document structures. That XPath-based link between the ML_dataset's Label_map points directly to the kind of prior knowledge that a neural net trying to learn about magazine and newspaper document structures can use. This linkage provides a "breadcrumb hint" that the neural net can use to reduce its discovery and refinement of the constraints on the free parameter configurations of the problem space. That XPath-based Label_map link to the metamodel's PRESSoo Issuing Rules introduces a wealth of insights to the neural net's learning effort. Here is a screenshot dip into the Issuing Rules for the Advertising Model of Softalk magazine:

fmtk_ml_tim_maggts_2

The more we can help a neural net like #dhSegment to make the connection between its learning effort and the prior knowledge of a specific domain, e.g. historic document text- and data-mining, the "smarter" and more efficient those networks can be when investigating a new instance/problem in that domain.

As I approach closing this detailed project update, I believe that there is an equally important but previously unstated advantage for the #MAGAZINEgts approach to the incorporation of prior knowledge into the pipeline of a neural net's Deep Learning. In his blog post distillation about #PrecisionLearning, Andreas states, "First, the introduction of a known operation into a neural network always results in a lower or equal maximal training error bound. Second, the number of free parameters in the model is reduced and therewith also the size of the required training data is reduced."

Implicit in Andreas' insight about the incorporation of prior knowledge into a #PrecisionLearning network design is the corollary that if training datasets can be reduced in size, they become more susceptible to dataset imbalance issues. Under-sampled parameter-space configurations become far more critical when trying to train a #dhSegment or similar #CNN model by incorporating prior knowledge together with using smaller training datasets. Fortunately, as is the case with the design of the #MAGAZINEgts format, the codification of that prior knowledge used by the network can be the source for the solution of this training dataset imbalance issue.

So now, in closing, here is a second animated GIF that shows how the #MAGAZINEgts format provides all the information necessary to generate synthetic training data instances to address under-sampled label classes in #dhSegment and similar neural net training datasets,

fmtk_syndata_pgs

I hope this detailed project update is insightful and provocative enough to encourage you Good Folks doing #dhSegment at (Twitter) @DHI_EPFL and at Andreas' Pattern Recognition Lab (via Twitter at) @FAU_Germany to more actively collaborate with indie #CitizenScientists, me and my fellow cancer-surviving wife and project partner (Twitter) @TimlynnBabitsky. The (Twitter) @TimeMachineEU FET Flagship project is awesome and so important. We are so tired and disappointed to be on the "wrong side of The Pond" during this moment of such great #DigitalHumanities and #TDM innovation. If #DARIAHbeyondEurope is more than a slogan, please find a way to help us be a part of your exciting and important work.

Happy-Healthy Vibes from Colorado USA, -: Jim Salmons :-

P.S. The currently published #MAGAZINEgts file under development by FactMiners and The Softalk Apple Project is linked on the About page of the Softalk Apple collection at the Internet Archive. The XML file itself is here: https://archive.org/download/softalkapple/softalkapple_publication.xml.

Jim-Salmons commented 5 years ago

Hello Sofia @solivr! :-) And by extension Frederick, Beniot @SeguinBe, Dario, Isabel, Maude, etc. hello again from Colorado USA! :-)

I am in the final days pulling together our #DATeCH2019 poster which will feature #MAGAZINEgts handling of unbalanced #dhSegment training datasets, and I have a couple quick, relatively minor, questions:

A further naive assumption on my part is that the "no ad" instances would simply be page images in the eval/test subsets and that there would be no label/mask images for these images as the "no ad" case is, I am assuming, not included in the training subset.

If, on the other hand, the "no ad" case needs to be included in the training subset, it must follow that -- in the case of red being used for the label/mask bounding-boxes -- that the "no ad" instance would have a label image sized to max_pixels for that page and the label image would be all white (the background color) with no red pixels in that image. This is essentially the opposite case of a full-page ad, which you Good Folks have suggested should be represented by a label/mask image that is, in this case, all red with no white as there is no page content "outside" of the bounding-box of the ad.

Any quick advice you folks can offer in this regard would be greatly appreciated. The deadline for submission of #DATeCH2019 posters is April 20th, so I am in scramble mode and look forward to your reply.

BTW, I have just finished integrating the Open Source BaseX native XML database into the Python-based FactMiners Toolkit (fmtk) for direct editing of the #MAGAZINEgts ground-truth file and it is AWESOME!!! This has collapsed the workflow of metadata discovery and curation significantly. I was writing intermediary files in JSON, rounding them up and converting to XML before copy-pasting the new data into the master #MAGAZINEgts file. That is now all going away thanks to direct-editing of a local copy of the full #MAGAZINEgts file in BaseX! :-)

Happy-Healthy Vibes from Colorado USA, -- Jim --