OpenUserJS / OpenUserJS.org

The home of FOSS user scripts.
https://openuserjs.org/
GNU General Public License v3.0
847 stars 300 forks source link

Non latin search NG ? #133

Open jesus2099 opened 10 years ago

jesus2099 commented 10 years ago

Hello, It seems that non latin script search is not functioning. 歌詞 should find kasi. PLAIN TEXT LYRICS 歌詞コピー 純文本歌詞 (same if i try to manually encode the URL with %E6%AD%8C%E8%A9%9E).

sizzlemctwizzle commented 10 years ago

It probably has to do with the use of a route param instead of a query parameter. This is being fixed with #103.

Martii commented 10 years ago

Not working in current production with 歌詞 pasted into the search box. Landing on https://openuserjs.org/?q=%E6%AD%8C%E8%A9%9E ... returns no results

jesus2099 commented 10 years ago

Just in case, I tried with 歌詞コピー (but a real CJK search should also find 歌詞 as words are not separated by spaces) and it doesn’t work either.

Martii commented 10 years ago

I'll see what I can do. USO doesn't support this either so flipping label to feature/enhancement and assigning myself for the time being... help is always appreciated though. This is something we should strive to support. Probably just some UTF-8 flag somewhere.

Zren commented 10 years ago

We're using regex to look things up, so it should be possible. Just means we need to change these regexes a bit.

https://github.com/OpenUserJs/OpenUserJS.org/blob/master/libs/modelQuery.js#L46

We're using \b

So we'll need to make our own \b for unicode.

Misc links:

Martii commented 10 years ago

I think Johan touched on some of what you are referencing on MDN and I used it in one of my scripts for a different purpose... I'll see if I can dig up that link and post momentarily...


Here we go... That way we could possibly search on that kind of modified content... although it may use up quite a bit of drone memory/cycles if unicode is found in a search string... What do you think @Zren?


A native node.js package would be preferred if these fail to meet expectations.

See also:

If we do end up using any route like these we should try to only exec this if unicode is detected and default to native V8 for everything else.

jesus2099 commented 10 years ago

Thanks for looking into this ! :)

Martii commented 10 years ago

So tried out the XRegExp package and couldn't get it to work well under the current architectural project design... but did figure out how to emulate (this means not exact btw) the word boundary for Unicode enabled strings... mixed this in with a bit of installWith code (GPL v3+ btw) and it appears it is good to go for intense dev testing.

Works with:

I am not going to submit a pr until this is cross-confirmed as functionally equivalent plus for the RFE in this issue by multiple devs... but it can be checked out from this named issue branch number since my master/branch is currently sync'd with upstream/master if my GH repo is added as a remote.

sizzlemctwizzle commented 10 years ago

@Martii I'd argue we could test this better in production (not everyone knows how to build the site) since it doesn't affect ASCII searches and at worst doesn't work on non-ASCII (current behavior). Can you submit a PR (after you take a look at my comment of course)?

Martii commented 10 years ago

Can you submit a PR

I need a break... been reading and testing on this all day today... but I'll see if your line note change meets with my basic testing results after I get some food. :)

Martii commented 10 years ago

Alright ready for final check, merge and deploy @sizzlemctwizzle

@jesus2099 I will need you to test this out fully with your system when sizzle deploys it with your @name and @description and your user-content (the script description via Edit Script Info) and see if this meets your needs... if not please report back. I'll close this issue in about 3 days if I don't see/hear any problems.

Martii commented 10 years ago

Leaving open during "needs testing" phase.


Currently deployed.

jesus2099 commented 10 years ago

Thanks Marti, here is what I tested. 音楽 and 音楽の森 show that it works for name and description (summary). But in fact it seems it does not work anytime : 直接のリンク and サービス should have found JASRACへの直リンク — even 直リンク (a word part of the name) does not find it.

Martii commented 10 years ago

MAJOR REEDIT:

hmmm may have found a solution...

Martii commented 10 years ago

I'm out of ideas e.g. hung... it was working on dev before I went to sleep but now it's not... the only 100% consistent time this currently works is if prefixStr and fullStr are identical e.g. no word boundary when nonASCII terms are encountered. We might have to drop "starting with" searching feature and just use full searches when nonASCII detected and reimplement when re has full Unicode support (whatever year that is). So dropping that specific feature in that use case or another yet to be presented solution... this needs discussion, vote, implement with possible override or table. e.g. what to do next?

I also noticed prop values for fullSearchFields covers different data than what is used in prefixSearchFields. This may be a bug elsewhere... but I don't see where these are initially set.

Some helpful reference material evaluated for a lot of this:

Martii commented 9 years ago

Had some additional transient thoughts on this for the next assignee:

Martii commented 8 years ago

@jesus2099

But in fact it seems it does not work anytime : 直接のリンク and サービス should have found JASRACへの直リンク — even 直リンク (a word part of the name) does not find it.

Exactly which scripts of yours are each of the quoted queries supposed to find again?

EDITED: Would you simplify the search down to the smallest query? e.g. preferably one and two characters only please.

Found one... but some of those characters don't seem to exist as whole words. Thanks.

Martii commented 8 years ago

Just a FYI https://openuserjs.org/?q=ello+world doesn't pull up my related unit test scripts... I don't think it does mid-word to end-word searches... just beginning to mid. I'll reverify that in a few days in the code itself. (after this holiday)

Martii commented 8 years ago

Temporarily forked your script here with adding a space and performed this query... so most likely it is from beginning of word to mid searching.

Refs:

jesus2099 commented 8 years ago

The specificity of CJK texts is that you should not expect any space characters anywhere. Just characters, punctuations and line breaks. But maybe the search engine we have can simply not cope with it. Tell me if you still have some questions. :)

Martii commented 8 years ago

Well I think this answer covers what you are looking for that I linked above in the refs. Everything I've skimmed over with my latest research says it's whole words and every example is beginning on the word not in the middle of it... like I said I'll double check our code to see if we are doing something different... but I doubt we are going to sub-index words (assuming I'm using their terminology correctly)... I don't believe we have the CPU/MEMORY power for that and that could get seriously expensive if we tried.

From what I've read on MongoDB it's majorly used at v2.x and v3.x is on our development but searching does the same thing on local pro vs development... so I don't think you are going to get mid-word to end search capability. If I can't find an answer I'll have to add the "tracking upstream" label... it may appear somewhere along code migrations but it could be a loooooooooooong time (as we've already experienced). I'm not ready to dive into another DB system either as I'm just entering figuring out MongoDB... plus I don't think sizzle would like that large of a change at this time.

jesus2099 commented 8 years ago

OK but just for the record, you speak about words but in the sense of unspaced characters not words here. サービス is a word, 直接のリンク are 3 words (sort of) : 直接, の and リンク. 音楽 is a word, 音楽の森 are 3 : 音楽, の and 森. We can’t expect Japanese to look for separate words. It is completely impossible to imagine stuff like spaced out texts : 音楽 の 森 の 直接 の リンク (in script descriptions in paticular but in any texts in general). So MongoDB is not CJK friendly. :) Thanks for your time, as always.

jesus2099 commented 8 years ago

I agree it is difficult. I think yahoo, google, etc. — for both the text that user types and for indexing pages — tries each characters as if there were spaces and also tries sets of consecutive characters against its dictionary.

Martii commented 8 years ago

google, etc. — for both the text that user types and for indexing pages — tries each characters as if there were spaces and also tries sets of consecutive characters against its dictionary.

Well funny that you should mention that... I just dealt with trespasserW on this in a script issue... there's a limit on the number of characters they will process... which this exact issue may explain why. OUJS doesn't have as much processing power as a search engine does. Attempting to do that would be a vain and very expensive effort and probably would affect how OUJS is presented. Currently we have no Ads but I guarantee that there would be if we even tried to compete with the processing power of a search engine. Privacy would be a thing of the past trying to purchase server clouds, internet backbones, etc... unless everyone involved has a huge pocketbook to contribute I think it may have to be as is.

is not CJK friendly.

I don't think computer languages in general are to that type of language. The space is an important delimiter. I don't mean to sound rude or insensitive but CJK should adopt some sort of space (as a breather at the very least) because no person is able to speak or think without pauses and you'll eventually run out of parchment paper... so there has to be "breaks" with words somehow. ;) Human brains have always been more adaptable than most computers too... which is why they haven't taken over. ;) ... alas I'm drifting off topic here.

サービス is a word

In all computer languages that would be delimited by a terminating string null usually in C/PP native app which includes JavaScript... and creating a clause requires spaces as part of grammar.

When I was trying translating some of your text I noticed that one character meant one thing and then the next character changed its meaning too... at least according to google...

1) straight 1 + 2) direct

... an so on.

Anyhow... if you find something in those refs that solves the situation, even if it's at a later date, by all means please let everyone know. Part of the experience on OUJS _(and node)_ is contributing with whatever capabilities one has. :)

Martii commented 8 years ago

Here's a thought.. since you know CJK way better than I do... Would you be willing to run some tests with all the HTML entities (Unicode versions though and I'm still not sure if we transform from UTF-16 yet) with the different types of spaces? There is the non-breaking space that comes to my thoughts first out.

jesus2099 commented 8 years ago

I guess there is a zero width space in the unicode table, it seems to be what you are looking for. :) But you can’t ask millions of people to use spaces when they have never done. ;) Your remark is good, some words are compound, it’s like in English “step father”. In English you are free to assemble or to let compound words separate by space. The fact that there is space before and after a compound word, the same space as inside it, is no problem for the reader to understand that these are a compound word, thanks to context. It’s the same in CJK (I know Japanese at least) You don’t need something supplemental to say warning these are a compound word, spaces are not necessary to distinguish single “character” words from “compound words”, the context is enough. You can see how small is the bar space on Japanese keyboards, it’s merely used to select next word in predictable typing, much less often than us Latin alphabet users. When a monosyllabic language like Vietnamese, which beforehand used Chinese characters with no spaces started using Latin alphabet, it had to use spaces. :)

Martii commented 8 years ago

But you can’t ask millions of people to use spaces when they have never done

Sure I can! :wink: :smile:

iamgoingtotrysomethinghereandseeifyoucanreaditinenglish.thiswillbeignoringallcapitalizationbutkeepingpunctuation.itseemstomethatifihadanovelwrittenlikethisitwouldnotbesearchablebyanymeans. ;)

zero width space in the unicode table

Ahh thanks.. that's the other one I couldn't think of.

jesus2099 commented 8 years ago

I have just sent a private message in your sourceforge account (it seems impossible here). :wink:

sizzlemctwizzle commented 8 years ago

Just to jump in an explain how the search currently works:

The search term is broken up using spaces to extract "words" (more like ordered character groups). Multiple fields are searched for the the presence of all words from the search term. So if the search term contains two "words", both must be present in the script title for that field to match. Some fields are searched for exact matches on "words", and others only care if the beginning of a word matches a search word.

No exactly the best search algorithm, but I'd have to quit my job and use NLP to build a really good search engine. On Dec 23, 2015 1:39 AM, "Marti Martz" notifications@github.com wrote:

Just a FYI https://openuserjs.org/?q=ello+world doesn't pull up my unit test scripts... I don't think it does mid-word to end searches... just beginning to mid. I'll reverify that in a few days.

— Reply to this email directly or view it on GitHub https://github.com/OpenUserJs/OpenUserJS.org/issues/133#issuecomment-166826776 .

Martii commented 8 years ago

@jesus2099 Yah it's impossible now... they (GH) removed that a few years ago.

Btw https://github.com/OpenUserJs/OpenUserJS.org/search?utf8=%E2%9C%93&q=novel doesn't find my post above in this issue either... so GH (seems to) require spaces as well.

@sizzlemctwizzle Awesome! Always good to have the genius creator hop in to explain things. :)

jesus2099 commented 8 years ago

iamgoingtotrysomethinghereandseeifyoucanreaditinenglish.thiswillbeignoringallcapitalizationbutkeepingpunctuation.itseemstomethatifihadanovelwrittenlikethisitwouldnotbesearchablebyanymeans. ;)

I know it’s just fun sarcasm because I already explained, but — just in case — no-space in Japanese is nothing but natural to read, it’s not a challenge as it is with Latin alphabet, where it is nearly impossible. Adding spaces would not help at all and would just end up looking :alien: awkward.

Martii commented 8 years ago

Well it was also a test for GH as I just replied.

Martii commented 8 years ago

@jesus2099

... use spaces ...

In all seriousness... at the very least between CJK and en-US is usually a good thing. With @name being an "undefined" language and most of the multilingual contributors on OUJS use it only I think an en-SPACE would be a good thing between the different Unicode terms.

jesus2099 commented 8 years ago

It works in GH when you specify issue type (default is code) : thatwordyoutested and even 純文本 (which is not separated with spaces). :blush:

But I am absolutely not underestimating the complexity of making up such a search. I don’t even know anything about it. :beginner:

Martii commented 8 years ago

@jesus2099 Might have been a delay in parsing this issues content... which would possibly indicate they have more background processing power as well and probably on a low clock cycle with the instances/threads. It's showing up now with my query. They probably have some sort of dictionary caching that MongoDB doesn't... they probably "love me" at GH with all my additions... good test data.

I don’t even know anything about it.

Well we can all learn together if everyone is willing and of course available time. It takes me longer to digest what has been said than sizzle comparitively but usually I am at a slower pace then to a faster one when I understand things better.

Martii commented 8 years ago

private message in your sourceforge account

Don't see it there but I'll keep looking. SF has its own issues too. To allay any possible misunderstandings I appreciate your contributions and queries here.