UUDigitalHumanitieslab / I-analyzer

The great textmining tool that obviates all others
https://ianalyzer.hum.uu.nl
MIT License
6 stars 2 forks source link

U-Blad Corpus #1590

Closed Meesch closed 2 weeks ago

Meesch commented 1 month ago

Adds the U-Blad corpus. This corpus is already indexed and ready to use on the test server! It should not be a lot of work to review this since it is not adding any new functionality. Potentially interesting note is that in this corpus definition the soup_transform_func is used not to just extract some text from a node but instead is used to format the strings inside of the node, which could potentially be later expanded with styling features, as hocr contains stylistic classes such as bold/italic etc and font size.

Meesch commented 1 month ago

I think I have addressed all your comments, see if you agree with my solutions. No worries if this does not make it into the next release!