Open verachell opened 2 days ago
I would argue that you do have a language model, but one that is created on the fly.
I have completed my project. The code is at https://github.com/verachell/Algorithm-pretending-to-be-AI-NaNoGenMo-2024
The "novels" are here, they differ in the keywords that were input by the user:
The stories are not coherent (not even as coherent as I'd hoped, which wasn't a high bar) but the different keywords did yield different results. Here are some examples to compare:
Uranus, Neptune--each of bird calls and steam."
"We know I'd been stopped at feet, and equally significant changes its prey, going too.
York sidereal time spectroscopic binaries known, including how I'd want me away until you're too sorry now.
Allan thinks so funny in Diana's voice indicated her so."
"Well now--if it needn't prey any garden rakes?" stammered Matthew.
"Very well! suppose E magnifying is vastly more being we thus cause it needn't prey any anchorage; and oh, Marilla, there really thought crossed two spectra furnished them after you could assign, were given directly in appearance simultaneously as follows: 2h. 44.5m. First Contact was continually shifted relative state is shown from flowing into pendants, and, although Saturn a prey to them," replied Conseil, "or the curious bird. Conseil rose must contain some hours."
Uranus we might recite it needn't prey any question the curious bird. Conseil appeared.
What good only exclusively marine quadruped. This correction ought only ties which now seventy-four years wherever science led. Never once told her hard Macadam, water running diagonally across water rapidly when they're thought they'd look too cold the ice; a cold, gray December evening, and perpetual cold, mitigated by astronomers, but your place are middling—a sorry because nobody ever built, let each group covers an ice had heard moving. My eyes, bearing upon No.
"I'd let any relation of cold."
From Fig. determine numerically their cold, fleshless hands is broken. You'll feel bad?"
"I'm real natural curiosities of "ice blink." However thick is based lies above sea level--i. e., followed these ice at White Sands, and cold, sinkaway feelings as cold when an ice-field, the ice-field, crushing it wrong after nightfall. But his body." "My ankle," gasped Anne.
The program informs the user if none of the keywords appear in the text sources and the program halts. Here is an example of what is printed on screen for user input keywords blog
and influencer
(words that you can reasonably assume would not be present in Project Gutenberg books):
Getting text...
Enter your desired keywords separated by spaces: blog influencer
Sorting out words...
* I don't have any information in my data involving "blog" and "influencer". Therefore "blog" and "influencer" won't appear in the final output
Sorry, I can't continue. I don't have information in my data about any of the keywords you entered
If there are already AI's pretending to be people, and people pretending to be AI's... we may as well have an algorithm that pretends to be an AI.
Because why not?!
This algorithm will generate 50,000 words using:
So you can think of it as a tiny language model, but in this case it's purely algorithmic.
A fellow participant is also working on a very small language model, see https://github.com/NaNoGenMo/2024/issues/4 but both projects are distinct and implemented differently.
How the algorithm works
My algorithm will be a meld of Markov Chaining and the Cut-Up method, both kind of mashed together in one algorithm. Cut-Up is when a text is segmented and re-arranged to form new text.
Briefly, what happens is that I start by segmenting the text at its most commonly used words. These are defined as the most commonly used words in the text sources used (not necessarily in the English language in general).
Segmentation is done such that each resulting fragment begins and ends with one of the most commonly used words.
So in a brief example, the text
a cat was sitting on a mat and a hamster was running on a wheel and playing
Now, there's not a ton of words here so let's just assume our most commonly used words are
a, on, and, was
The fragments would become:
a cat was
was sitting on
on a
a mat and
and a
a hamster was
was running on
on a
a wheel and
and playing
is omitted because it doesn't end with a commonly used wordThose fragments then form the basis of a Markov Chain lookup table. Therefore, suppose I randomly start with the fragment
a hamster was
. The next fragment must start withwas
, so the algorithm narrows this down towas sitting on
andwas running on
- it will pick one of these options randomly, and so on. However, if one of these options contained one of the user keywords, it would pick one of the ones containing the user keyword.This is a bit different to regular Markov chaining in that the lengths of the fragments in my table will vary. For example, if the original text contains a phrase with several uncommon words in a row, they will wind up all together in 1 fragment, because the text is being cut at the most common words. By contrast, Markov chaining is usually implemented with constant lengths of fragments, typically 2 or 3 words.
I'm hoping that this method might generate slightly more coherent text than standard Markov chaining, but it's not clear whether this would really be the case in practice.