ben4808 / block-quarry-ui

MIT License
0 stars 0 forks source link

Query syntax suggestions/improvements #1

Open frozenpandaman opened 2 years ago

frozenpandaman commented 2 years ago

Hi! This is a wonderful tool – I particularly love the unique datasets like Wheel of Fortune, podcasts, etc. I almost wonder if the 'newspaper' one could be included to pull in the NOW corpus which is used along with COCA and others in linguistics quite a bit… this would provide access to incredibly recent, fresh words and phrases.

One thing I'd love to see is the query language supporting elementary regex or some more custom characters. A huge improvement would be the ability to use @ or #, for example, to search for only vowels or consonants in that position – instead of . which looks for all letters. Also, being able to write [^s] to exclude s from occurring in that position.

I'm not super familiar with JS/TS and not sure how the codebase is structured, but if you're able to point me to the file where the "query language" is implemented, I might be able to help out as well and work toward adding in even some basic functionality in this vein.

(I'm also using the tool a bit differently than intended – not to build my wordlist, but rather as a wordlist itself!)

Looking forward to the future of the project!

bzoon commented 2 years ago

I'm glad you're enjoying the tool!

First, to get the full word list, go ahead and hit https://blockquarry.net/?getAllExplored and wait a few seconds for it to download. This endpoint is not advertised because I may consider charging for it someday, but the site isn't to that point yet. Once the full list is downloaded, you can run regex queries on it in any text editor.

Second, if you have better datasets downloaded, feel free to email them to me and I can see about getting them into the app.

Finally, if you want to dive into the code itself, that would be great! With this repo and block-quarry-api and a local PostgreSQL installation, it shouldn't be TOO hard to get running on your machine. The database schema can be loaded using https://github.com/ben4808/block-quarry-api/blob/main/schema2_postgres.sql and https://github.com/ben4808/block-quarry-api/blob/main/sp2_postgres.sql. Converting the text query into a SQL is done in the explored_query and frontier_query_indexed stored procedures. I haven't implemented regex queries myself because I have a per-letter index scheme that is super optimized for this specific type of query. A regex search would probably perform well on the explored list, but not on huge datasets like Podcasts.

frozenpandaman commented 2 years ago

@bzoon Awesome, thank you!! (also wait why do you have two github accounts lol)

So that's just like a massive wordlist then, haha. How often does it get updated, and is the process automatic?

I understand potentially charging for it, though I'm an open-source developer and feel like the crosswords/cruciverbalist world desperately needs more high-quality, free tools and resources. So much exclusive, paywalled stuff… it's pretty shocking as someone new to the space! Maybe consider asking for donations? I think if a tool really helps someone, they often will donate a bit if they have the means! (As long as it's not subscription model I'll survive, though… and I know this is all a ways off, anyway!)

I think it would be very cool to include one of these large corpora some day – like iWeb or NOW – though the full data would cost me $375 as a student (and more when I graduate… 😅) so maybe a bit out of scope! But it'd be very cool to somehow get new phrases from news articles, say, added into the datasets… almost reminds me of what @MaxBittker's @NYT_first_said twitterbot is doing, though it'd have to a bit more curated!

bzoon commented 2 years ago

The idea was for it to be crowdsourced -- it will provide a mutual benefit for crossword constructors while they are filling their grids while also improving the data for everybody. The biggest problem to be solved is how to to put a number on the quality of an entry, and currently I don't have a better technology for this than the pre-trained neural nets that are people's brains. This interface just makes it as physically easy as possible for people to download their brain info into the database.

If you find or develop any good data sources for new phrases, let me know! It's harder than it looks --- for example, that NYT twitter account is pretty much all stuff that I would never use in a crossword, simply because no one would ever get the answer! I think the next step is to dive into more niche terms that can be tagged with topics to allow people to build tailored crosswords for certain interests. I also want to allow users to upload their own word lists to query alongside the existing ones.

I sympathize with your sentiment about too many people charging for word lists. My word list is, I believe, the best free word list out there. Right now the app isn't really in a marketable state so the only reason I'd think about monetizing it is if people started using it enough to generate larger AWS bills, but for now it's basically just me that uses it for my puzzles.

Finally, it is nice to see someone else get excited about this lol. I posted it once to the Crosshare Discord, and zero people found it useful. Together with Crosshatch, it makes constructing crosswords so much better! I don't even use Crossfire anymore. That being said, I'm germinating a new idea that I think will be even cooler, and as such I haven't been as motivated to work on this stuff lately. I will likely come back to it at some point though.