Adds the U-Blad corpus. This corpus is already indexed and ready to use on the test server! It should not be a lot of work to review this since it is not adding any new functionality. Potentially interesting note is that in this corpus definition the soup_transform_func is used not to just extract some text from a node but instead is used to format the strings inside of the node, which could potentially be later expanded with styling features, as hocr contains stylistic classes such as bold/italic etc and font size.
Adds the U-Blad corpus. This corpus is already indexed and ready to use on the test server! It should not be a lot of work to review this since it is not adding any new functionality. Potentially interesting note is that in this corpus definition the
soup_transform_func
is used not to just extract some text from a node but instead is used to format the strings inside of the node, which could potentially be later expanded with styling features, as hocr contains stylistic classes such as bold/italic etc and font size.