Open jesus2099 opened 10 years ago
It probably has to do with the use of a route param instead of a query parameter. This is being fixed with #103.
Not working in current production with 歌詞
pasted into the search box. Landing on https://openuserjs.org/?q=%E6%AD%8C%E8%A9%9E
... returns no results
Just in case, I tried with 歌詞コピー (but a real CJK search should also find 歌詞 as words are not separated by spaces) and it doesn’t work either.
I'll see what I can do. USO doesn't support this either so flipping label to feature/enhancement and assigning myself for the time being... help is always appreciated though. This is something we should strive to support. Probably just some UTF-8 flag somewhere.
We're using regex to look things up, so it should be possible. Just means we need to change these regexes a bit.
https://github.com/OpenUserJs/OpenUserJS.org/blob/master/libs/modelQuery.js#L46
We're using \b
\b
assert position at a word boundary (^\w|\w$|\W\w|\w\W)
\w
match any word character [a-zA-Z0-9_]
So we'll need to make our own \b
for unicode.
Misc links:
I think Johan touched on some of what you are referencing on MDN and I used it in one of my scripts for a different purpose... I'll see if I can dig up that link and post momentarily...
Here we go... That way we could possibly search on that kind of modified content... although it may use up quite a bit of drone memory/cycles if unicode is found in a search string... What do you think @Zren?
A native node.js package would be preferred if these fail to meet expectations.
See also:
xregexp-all.js
but doesn't install using $ npm install xregexp-all.js
) ... a npm homepage... does install with $ npm install xregexp
If we do end up using any route like these we should try to only exec this if unicode is detected and default to native V8 for everything else.
Thanks for looking into this ! :)
So tried out the XRegExp package and couldn't get it to work well under the current architectural project design... but did figure out how to emulate (this means not exact btw) the word boundary for Unicode enabled strings... mixed this in with a bit of installWith code (GPL v3+ btw) and it appears it is good to go for intense dev testing.
Works with:
/?q=歌詞
returning currently one result on dev... expected (one by me pseudo forked from production)/?q=歌詞+text
returning currently one result on dev... expected (one by me pseudo forked from production)/?q=hel+pa
returning currently two results on dev... expected (one by sizzle and one by me)I am not going to submit a pr until this is cross-confirmed as functionally equivalent plus for the RFE in this issue by multiple devs... but it can be checked out from this named issue branch number since my master/branch is currently sync'd with upstream/master if my GH repo is added as a remote.
@Martii I'd argue we could test this better in production (not everyone knows how to build the site) since it doesn't affect ASCII searches and at worst doesn't work on non-ASCII (current behavior). Can you submit a PR (after you take a look at my comment of course)?
Can you submit a PR
I need a break... been reading and testing on this all day today... but I'll see if your line note change meets with my basic testing results after I get some food. :)
Alright ready for final check, merge and deploy @sizzlemctwizzle
@jesus2099
I will need you to test this out fully with your system when sizzle deploys it with your @name
and @description
and your user-content
(the script description via Edit Script Info
) and see if this meets your needs... if not please report back. I'll close this issue in about 3 days if I don't see/hear any problems.
Leaving open during "needs testing" phase.
Currently deployed.
MAJOR REEDIT:
hmmm may have found a solution...
I'm out of ideas e.g. hung... it was working on dev before I went to sleep but now it's not... the only 100% consistent time this currently works is if prefixStr
and fullStr
are identical e.g. no word boundary when nonASCII terms are encountered. We might have to drop "starting with" searching feature and just use full searches when nonASCII detected and reimplement when re has full Unicode support (whatever year that is). So dropping that specific feature in that use case or another yet to be presented solution... this needs discussion, vote, implement with possible override or table. e.g. what to do next?
I also noticed prop
values for fullSearchFields
covers different data than what is used in prefixSearchFields
. This may be a bug elsewhere... but I don't see where these are initially set.
Some helpful reference material evaluated for a lot of this:
\P{L}
broke things plus it doesn't cover numbers with the negation which implies V8 doesn't work properly in this arena)Had some additional transient thoughts on this for the next assignee:
escape()
converts to ASCII percent encoded.$regex
may use a different regular expression engine (PCRE)... e.g. we might be able to handle Unicode better with Perl's implementation, assuming that is picked and tested against.@jesus2099
But in fact it seems it does not work anytime : 直接のリンク and サービス should have found JASRACへの直リンク — even 直リンク (a word part of the name) does not find it.
Exactly which scripts of yours are each of the quoted queries supposed to find again?
EDITED: Would you simplify the search down to the smallest query? e.g. preferably one and two characters only please.
Found one... but some of those characters don't seem to exist as whole words. Thanks.
Just a FYI https://openuserjs.org/?q=ello+world doesn't pull up my related unit test scripts... I don't think it does mid-word to end-word searches... just beginning to mid. I'll reverify that in a few days in the code itself. (after this holiday)
Temporarily forked your script here with adding a space and performed this query... so most likely it is from beginning of word to mid searching.
Refs:
The specificity of CJK texts is that you should not expect any space characters anywhere. Just characters, punctuations and line breaks. But maybe the search engine we have can simply not cope with it. Tell me if you still have some questions. :)
Well I think this answer covers what you are looking for that I linked above in the refs. Everything I've skimmed over with my latest research says it's whole words and every example is beginning on the word not in the middle of it... like I said I'll double check our code to see if we are doing something different... but I doubt we are going to sub-index words (assuming I'm using their terminology correctly)... I don't believe we have the CPU/MEMORY power for that and that could get seriously expensive if we tried.
From what I've read on MongoDB it's majorly used at v2.x and v3.x is on our development but searching does the same thing on local pro vs development... so I don't think you are going to get mid-word to end search capability. If I can't find an answer I'll have to add the "tracking upstream" label... it may appear somewhere along code migrations but it could be a loooooooooooong time (as we've already experienced). I'm not ready to dive into another DB system either as I'm just entering figuring out MongoDB... plus I don't think sizzle would like that large of a change at this time.
OK but just for the record, you speak about words but in the sense of unspaced characters not words here. サービス is a word, 直接のリンク are 3 words (sort of) : 直接, の and リンク. 音楽 is a word, 音楽の森 are 3 : 音楽, の and 森. We can’t expect Japanese to look for separate words. It is completely impossible to imagine stuff like spaced out texts : 音楽 の 森 の 直接 の リンク (in script descriptions in paticular but in any texts in general). So MongoDB is not CJK friendly. :) Thanks for your time, as always.
I agree it is difficult. I think yahoo, google, etc. — for both the text that user types and for indexing pages — tries each characters as if there were spaces and also tries sets of consecutive characters against its dictionary.
google, etc. — for both the text that user types and for indexing pages — tries each characters as if there were spaces and also tries sets of consecutive characters against its dictionary.
Well funny that you should mention that... I just dealt with trespasserW on this in a script issue... there's a limit on the number of characters they will process... which this exact issue may explain why. OUJS doesn't have as much processing power as a search engine does. Attempting to do that would be a vain and very expensive effort and probably would affect how OUJS is presented. Currently we have no Ads but I guarantee that there would be if we even tried to compete with the processing power of a search engine. Privacy would be a thing of the past trying to purchase server clouds, internet backbones, etc... unless everyone involved has a huge pocketbook to contribute I think it may have to be as is.
is not CJK friendly.
I don't think computer languages in general are to that type of language. The space is an important delimiter. I don't mean to sound rude or insensitive but CJK should adopt some sort of space (as a breather at the very least) because no person is able to speak or think without pauses and you'll eventually run out of parchment paper... so there has to be "breaks" with words somehow. ;) Human brains have always been more adaptable than most computers too... which is why they haven't taken over. ;) ... alas I'm drifting off topic here.
サービス is a word
In all computer languages that would be delimited by a terminating string null usually in C/PP native app which includes JavaScript... and creating a clause requires spaces as part of grammar.
When I was trying translating some of your text I noticed that one character meant one thing and then the next character changed its meaning too... at least according to google...
1) straight 1 + 2) direct
... an so on.
Anyhow... if you find something in those refs that solves the situation, even if it's at a later date, by all means please let everyone know. Part of the experience on OUJS _(and node)_ is contributing with whatever capabilities one has. :)
Here's a thought.. since you know CJK way better than I do... Would you be willing to run some tests with all the HTML entities (Unicode versions though and I'm still not sure if we transform from UTF-16 yet) with the different types of spaces? There is the non-breaking space that comes to my thoughts first out.
I guess there is a zero width space in the unicode table, it seems to be what you are looking for. :) But you can’t ask millions of people to use spaces when they have never done. ;) Your remark is good, some words are compound, it’s like in English “step father”. In English you are free to assemble or to let compound words separate by space. The fact that there is space before and after a compound word, the same space as inside it, is no problem for the reader to understand that these are a compound word, thanks to context. It’s the same in CJK (I know Japanese at least) You don’t need something supplemental to say warning these are a compound word, spaces are not necessary to distinguish single “character” words from “compound words”, the context is enough. You can see how small is the bar space on Japanese keyboards, it’s merely used to select next word in predictable typing, much less often than us Latin alphabet users. When a monosyllabic language like Vietnamese, which beforehand used Chinese characters with no spaces started using Latin alphabet, it had to use spaces. :)
But you can’t ask millions of people to use spaces when they have never done
Sure I can! :wink: :smile:
iamgoingtotrysomethinghereandseeifyoucanreaditinenglish.thiswillbeignoringallcapitalizationbutkeepingpunctuation.itseemstomethatifihadanovelwrittenlikethisitwouldnotbesearchablebyanymeans. ;)
zero width space in the unicode table
Ahh thanks.. that's the other one I couldn't think of.
I have just sent a private message in your sourceforge account (it seems impossible here). :wink:
Just to jump in an explain how the search currently works:
The search term is broken up using spaces to extract "words" (more like ordered character groups). Multiple fields are searched for the the presence of all words from the search term. So if the search term contains two "words", both must be present in the script title for that field to match. Some fields are searched for exact matches on "words", and others only care if the beginning of a word matches a search word.
No exactly the best search algorithm, but I'd have to quit my job and use NLP to build a really good search engine. On Dec 23, 2015 1:39 AM, "Marti Martz" notifications@github.com wrote:
Just a FYI https://openuserjs.org/?q=ello+world doesn't pull up my unit test scripts... I don't think it does mid-word to end searches... just beginning to mid. I'll reverify that in a few days.
— Reply to this email directly or view it on GitHub https://github.com/OpenUserJs/OpenUserJS.org/issues/133#issuecomment-166826776 .
@jesus2099 Yah it's impossible now... they (GH) removed that a few years ago.
Btw https://github.com/OpenUserJs/OpenUserJS.org/search?utf8=%E2%9C%93&q=novel doesn't find my post above in this issue either... so GH (seems to) require spaces as well.
@sizzlemctwizzle Awesome! Always good to have the genius creator hop in to explain things. :)
iamgoingtotrysomethinghereandseeifyoucanreaditinenglish.thiswillbeignoringallcapitalizationbutkeepingpunctuation.itseemstomethatifihadanovelwrittenlikethisitwouldnotbesearchablebyanymeans. ;)
I know it’s just fun sarcasm because I already explained, but — just in case — no-space in Japanese is nothing but natural to read, it’s not a challenge as it is with Latin alphabet, where it is nearly impossible. Adding spaces would not help at all and would just end up looking :alien: awkward.
Well it was also a test for GH as I just replied.
@jesus2099
... use spaces ...
In all seriousness... at the very least between CJK and en-US is usually a good thing. With @name
being an "undefined" language and most of the multilingual contributors on OUJS use it only I think an en-SPACE would be a good thing between the different Unicode terms.
It works in GH when you specify issue type (default is code) : thatwordyoutested and even 純文本 (which is not separated with spaces). :blush:
But I am absolutely not underestimating the complexity of making up such a search. I don’t even know anything about it. :beginner:
@jesus2099 Might have been a delay in parsing this issues content... which would possibly indicate they have more background processing power as well and probably on a low clock cycle with the instances/threads. It's showing up now with my query. They probably have some sort of dictionary caching that MongoDB doesn't... they probably "love me" at GH with all my additions... good test data.
I don’t even know anything about it.
Well we can all learn together if everyone is willing and of course available time. It takes me longer to digest what has been said than sizzle comparitively but usually I am at a slower pace then to a faster one when I understand things better.
private message in your sourceforge account
Don't see it there but I'll keep looking. SF has its own issues too. To allay any possible misunderstandings I appreciate your contributions and queries here.
Hello, It seems that non latin script search is not functioning. 歌詞 should find kasi. PLAIN TEXT LYRICS 歌詞コピー 純文本歌詞 (same if i try to manually encode the URL with
%E6%AD%8C%E8%A9%9E
).