INL / BlackLab

Linguistic search for large annotated text corpora, based on Apache Lucene
http://inl.github.io/BlackLab/
Apache License 2.0
106 stars 53 forks source link

Documents with no tokens cause issues #478

Closed jan-niestadt closed 11 months ago

jan-niestadt commented 1 year ago

If the data set contains documents with 0 tokens, problems can occur when calculating group stats or getting document statistics. Should be handled gracefully.

(https://portal.clarin.inl.nl/corpus-frontend-chn/chn-extern/search/docs?filter=languageVariant%3A%28%22BN%22+%22NN%22%29+AND+medium%3A%28%22newspaper%22%29+AND+%28witnessYear_from%3A%5B2013+TO+2013%5D+AND+witnessYear_to%3A%5B2013+TO+2013%5D%29&first=0&group=field%3AtitleLevel2%3Ai&number=20&interface=%7B%22form%22%3A%22search%22%2C%22patternMode%22%3A%22extended%22%7D&groupDisplayMode=table)

jan-niestadt commented 11 months ago

This appears to be a problem only with older BL versions; works propertly with newer versions (both integrated and external indexes). E.g. chn-intern