INL / chaining-search

Library and web interface to easily combine exploration of linguistic resources
2 stars 1 forks source link

CorpusQuery.xml() returns invalid XML if multiple requests were made #12

Open AntheSevenants opened 2 years ago

AntheSevenants commented 2 years ago

If a corpus query response is sizeable, another search will be performed starting from the end index of the previous search. However, the xml() method just returns a concatenation of all BlackLab XML responses:

https://github.com/INL/chaining-search/blob/ff005f075c4ffdc6c93df0346f0af32375daec8f/chaininglib/search/CorpusQuery.py#L278

The issue with this is that, essentially, we're combining multiple standalone XML files into one string. Feeding this string into any XML parser will not yield a parse, since there are multiple XML declarations in the document.

Unfortunately, I don't see how the xml() method in itself can be improved. There doesn't seem to be an elegant way to combine the information from multiple responses, but I think returning broken XML isn't a viable option either.

Some other options:

  1. Always return a list of all XML responses, regardless of how many requests were made
  2. Make the xml() method index-based, so it returns the XML response of that index. This implies that there should be a way to find out how many requests were made in the first place.

Of course, I'm just thinking out loud here. A workaround for me currently is to use CorpusQuery._response, which also contains the different requests separately (which is what I need).