Support pinyin for searching, and search across history/bookmarks by default

YesHub commented 5 years ago

感觉QuickNavigator可以借鉴下;-)

(Saving original title here: 能否默认同时搜索历史&书签呢，最好还可以支持拼音)

fwextensions commented 5 years ago

Assuming Google Translate is correct, the answer is that you have to type /h to search history items or /b to search bookmarks. There isn't currently an option to search across open tabs, history and bookmarks at the same time.

Are there any good fuzzy pinyin search libraries out there? A quick search didn't find much. I see QuickNavigator's code here, but it's quite old: https://github.com/qianlifeng/QuickNavigator

zgqq commented 5 years ago

@fwextensions Pinyin is composed of English letters, first convert Chinese characters to pinyin, and then search. This library (https://github.com/hotoo/pinyin/blob/master/README-us_EN.md) is easy to use.

fwextensions commented 5 years ago

@zgqq is there a specific style you'd expect to be able to enter the pinyin characters? The library has a number of options.

zgqq commented 5 years ago

@fwextensions Thank you for your response, every character has a corresponding pinyin, such as "我爱家人" pinyin is "wo ai jia ren", now want to feature is input "wo ai jia ren" can search out those results including "我爱家人", the role of the library is the ability to convert Chinese characters to pinyin.

fwextensions commented 5 years ago

Sounds like you'd be fine with the "normal" style mode in that library, which doesn't take into account different tones. The default style uses accented characters to distinguish different tones, like pīn yīn, but I don't know how hard those characters are to type, or how important it is to sort results based on how well they match the tones.

zgqq commented 5 years ago

@fwextensions Yes, no need to consider different tones, just use

pinyin(' 我爱家人', {
                      style: pinyin.STYLE_NORMAL
})

fwextensions commented 4 years ago

@zgqq, I finally had some time to try working on this feature. I have an early version of pinyin support in this branch.

If you wanted to try it out, you can download the zip archive of the branch, and unzip it. In Chrome, go to chrome://extensions/ and toggle on Developer mode in the top right. Then click the Load unpacked button and select the src/ folder inside the unzipped archive. This should load the pinyin version of QuicKey and add its icon to the toolbar. You can then try searching with pinyin to match any tabs with Chinese characters in the title.

I had a few questions about using pinyin for searching that maybe you could help with:

The library will sometimes return two or more pinyin strings if the character is a heteronym. How important is it to enable matching against both strings? In the branch I currently simply pick the first string and ignore the rest, which I assume isn't ideal.
There are segmentation and heteronym modes in the pinyin library. Do you happen to know if they should be turned on? The readme says the segmentation mode will make conversions run slower.
The branch is using the web version of the library, but there's also a node version that supports segmentation and traditional Chinese. I think I can make that work in an extension, but I don't know if the increased size is worth it.

I would appreciate any help you can offer.

zgqq commented 4 years ago

@fwextensions Thank you for your hard work. It is best to turn on segmentation and heteronym modes, if encounter those words having multiple pinyin, the library can automatically choose the correct pinyin and will not return other pinyin. Taiwan and Hong Kong also use traditional Chinese characters.

fwextensions commented 4 years ago

Okay, I'll see about getting the node version working in the extension.

fwextensions commented 4 years ago

Another question is how important it is to highlight the characters that match what you've typed. With ASCII characters, it's obviously straightforward to show where the query matches each string, but it would be difficult to highlight the specific character that matches each pinyin string in the query.

zgqq commented 4 years ago

@fwextensions Although I've seen some software implement this feature, it can be a challenge. Personally, it is ok without this feature.

fwextensions commented 4 years ago

@zgqq I've discovered that segmention won't work, as it depends on a module called nodejieba which is written in C++. That works in node, but not the browser. So I'm not sure if the extra 3MB that comes with the node version of the pinyin lib will be worth it. (Also, it seems to bring babel to its knees when building the extension, and the popup menu takes 2 - 3 times longer to open.)

I've enabled heteronyms in the library. Since there's no segmentation support, it can return multiple pinyin strings for a single character. I'm just including all of those, separated by spaces, in the pinyin string. So if a user expects to be able to type one of them, it'll be in the string to match against, but the extra heteronyms aren't visible anywhere.

I've pushed the heteronym changes to the feature/pinyin branch. I've also pushed the full node lib to the feature/pinyin-node branch. If you could download those and compare their behavior, that would be really useful. I'm leaning toward not using the node version, as it's so large and I can't use the segmentation support anyway.

zgqq commented 4 years ago

@fwextensions Instead, use the web version, but if the title contains a polyphonic word, it generates multiple pinyin strings. Then when searching for the title, matching in the pinyin strings.

fwextensions commented 4 years ago

@zgqq when you said:

if the title contains a polyphonic word, it generates multiple pinyin strings. Then when searching for the title, matching in the pinyin strings.

I think that's what the feature/pinyin branch is doing. It returns all the heteronyms, separated by spaces, so if you just type the one you think it should be, at least the pinyin string will be there to get matched.

Here are a few examples I found on the web. The bold text is what the characters get converted to, while the italics text is what it would be if the segmentation was working:

电邮弹回来: dian you dan tan hui lai, should be diàn yóu tán huí lái
调养身体: tiao diao zhou yang shen ti, should be tiáo yǎng shēn tǐ
音乐: yin le yue yao lao, should be yīn yuè

Note that in 3, that's what is returned by the node version, which has a bigger dictionary. The web version just returns yin le yue. Is that a big difference?

I think I could convert the node dictionary into the same format as what the web version uses, though it might still be a lot larger, but I think it would cover more characters.

zgqq commented 4 years ago

@fwextensions Normally, the web version is satisfied, it contains the most common pinyin.

fwextensions commented 4 years ago

@zgqq have you had a chance to try the feature/pinyin branch? This page has some instructions on loading an unpacked extension, though in the case of QuicKey, you'd load the src/ folder after unzipping the archive.

The pinyin matching seems to work, as far as I can test it. I'm just not sure how it feels for a native speaker. Also, I just noticed that the example string you gave above, "我爱家人", is converted to "wo ai jia jie ren" by the library, not "wo ai jia ren", due to the heteronym.

zgqq commented 4 years ago

@fwextensions I just tried it out, and it works fine. Thank you for your efforts.

fwextensions commented 4 years ago

@zgqq great, thanks for checking! I'll publish it to the store soon.

fwextensions / QuicKey

Support pinyin for searching, and search across history/bookmarks by default #18