PhonologicalCorpusTools / CorpusTools

Phonological CorpusTools
http://phonologicalcorpustools.github.io/CorpusTools/
GNU General Public License v3.0
113 stars 16 forks source link

phonological search #746

Closed kchall closed 3 years ago

kchall commented 4 years ago

On the phono_search_improvements branch (but not on the master):

  1. If you do a search, look at results, and toggle to the individual results, the toggle between individual and summary results is suddenly reversed (i.e., you can in fact see both sets of results, but you're looking at the individual ones when it says "show individual results" and looking at the summary ones when it says "show summary results").

  2. Choosing "add to current results table" causes PCT to crash.

@stannam I'm assigning these to you because they seem up your general alley, but there's no rush!

stannam commented 4 years ago

re: 1 I couldn't replicate this. Possibly the same OS-dependent problem as the flip icon issue (#740).

re: 2 File "D:\PycharmProjects\CorpusTools\corpustools\gui\models.py", line 577, in _summarize filters = self.rows[0][7], self.rows[0][8], self.rows[0][9], self.rows[0][10], self.rows[0][11], self.rows[0][12] IndexError: list index out of range

Seems easy.

stannam commented 4 years ago

gif

kchall commented 3 years ago

I think this is all fixed except that the 'negative search' results window doesn't contain the target / environment, and there's no indication that it's a positive vs. negative search.

kchall commented 3 years ago

I can confirm that I do now see the result type and that the target and environment show up in both positive and negative searches (usually!). But, I did unfortunately come up with quite a few other issues:

  1. In the results window, the way that the environment displays for the positive and negative searches is different (see screenshot). The negative search encloses the environment in {}. This particular example was done while just re-opening the function dialog and switching the result type, so there was no other difference in how the environment was actually originally specified. image

  2. When doing a search based on features, the individual target sounds are used instead of the category of the feature. E.g. if you search for #[+nasal], you get individual results for #m and #n words, rather than a single result for #[+nasal] words. That has pretty much always been true, and while I don’t love it, I can live with it…but I just noticed that it leads to something that I think is not okay. If you search for #[+nasal], and then re-open the dialog box and change the environment to be #m and add the results to the current results window, it doesn’t give you a new row — it literally adds the type and token frequencies to the existing row, doubling the counts. And if you re-run the search, it adds the frequencies again, etc., etc. I think that ‘adding to current results’ should always simply add a new row to the results table, never change the information in existing rows. So in this case, that would mean repeating the rows multiple times, even if they are identical.

  3. Hmm, actually, it looks like there is slightly different ‘category’ behaviour for the positive and negative searches, too. If you search for #[+nasal] with a positive search, you get separate entries for #m and #n, but if you re-open and do a negative search, you get a combined entry of #{m,n} (see screenshot). (I also don’t know why it inserts the new result between the other two — maybe it’s sorting alphabetically? but I think it would be better to apend to the end!). Regardless, though, the way of grouping the results should be parallel (either separating m and n, or grouping them together).

image

  1. When using the syllabified example corpus, the positive searches seem to be fine, but the negative search is again missing the target and environment information in the results window, and also doesn’t seem to show anything for the ‘summary’ window. (I specifically tried searching for a syllable with a [+nasal] onset and a [+syllabic] nucleus, in word-initial position. I’m attaching screenshots of the negative search results.)

image

  1. Cosmetically, I think it would be better to make the choices for “Search mode” and “Result type” radio buttons instead of check boxes, since only one can be selected at a time.

  2. When we update the documentation, we should make it clear how the result type interacts with the filters. What I think is happening is as follows — do you agree? The filter(s) are applied to the corpus and reduce what words are even being considered. E.g. if the minimum word frequency is 3, then all words with a frequency less than 3 are removed from consideration. The remaining words are essentially sorted into those that match the environment and those that don’t match. E.g., if I search for #m words, then words with a frequency of at least three are sorted into ones that begin with [m] and those that do not. If “positive” results are requested, it returns the ones that match (in this case, words that begin with [m] and have a frequency of at least 3); if “negative” results are requested, it returns the ones that don’t match (in this case, words that don't begin with [m] and have a frequency of at least 3).

Crucially, this is different from including the filters as part of the ‘matching’ process. E.g., I might have expected that searching for #m with a minimum frequency of 3 would return “words that begin with [m] and have a frequency of at least 3” if positive results were selected (as above), but “all words that don’t begin with [m] along with all words that don’t have a frequency of at least 3” if negative results were selected. That is, it would be reasonable to do the full search first, dividing the whole corpus into those that match all criteria and those that don’t, and then to return either the match or mis-match words accordingly.

I think it’s fine to have it operate either way, as long as we are clear in what it is doing!

kchall commented 3 years ago

I think this is all fixed and the documentation updated. Thanks!