An algorithm pretending to be an AI

verachell commented 2 days ago

If there are already AI's pretending to be people, and people pretending to be AI's... we may as well have an algorithm that pretends to be an AI.

Because why not?!

This algorithm will generate 50,000 words using:

whatever text file input data is used - in this case 4 Project Gutenberg books, although you can change this to any other text file
keywords that the user inputs manually

So you can think of it as a tiny language model, but in this case it's purely algorithmic.

A fellow participant is also working on a very small language model, see https://github.com/NaNoGenMo/2024/issues/4 but both projects are distinct and implemented differently.

How the algorithm works

My algorithm will be a meld of Markov Chaining and the Cut-Up method, both kind of mashed together in one algorithm. Cut-Up is when a text is segmented and re-arranged to form new text.

Briefly, what happens is that I start by segmenting the text at its most commonly used words. These are defined as the most commonly used words in the text sources used (not necessarily in the English language in general).

Segmentation is done such that each resulting fragment begins and ends with one of the most commonly used words.

So in a brief example, the text

a cat was sitting on a mat and a hamster was running on a wheel and playing

Now, there's not a ton of words here so let's just assume our most commonly used words are a, on, and, was

The fragments would become: a cat was was sitting on on a a mat and and a a hamster was was running on on a a wheel and and playing is omitted because it doesn't end with a commonly used word

Those fragments then form the basis of a Markov Chain lookup table. Therefore, suppose I randomly start with the fragment a hamster was. The next fragment must start with was, so the algorithm narrows this down to was sitting on and was running on - it will pick one of these options randomly, and so on. However, if one of these options contained one of the user keywords, it would pick one of the ones containing the user keyword.

This is a bit different to regular Markov chaining in that the lengths of the fragments in my table will vary. For example, if the original text contains a phrase with several uncommon words in a row, they will wind up all together in 1 fragment, because the text is being cut at the most common words. By contrast, Markov chaining is usually implemented with constant lengths of fragments, typically 2 or 3 words.

I'm hoping that this method might generate slightly more coherent text than standard Markov chaining, but it's not clear whether this would really be the case in practice.

MichaelPaulukonis commented 1 day ago

I would argue that you do have a language model, but one that is created on the fly.

verachell commented 21 hours ago

I have completed my project. The code is at https://github.com/verachell/Algorithm-pretending-to-be-AI-NaNoGenMo-2024

The "novels" are here, they differ in the keywords that were input by the user:

Story_bird_prey_7025.md - keywords: bird prey
Story_cold_cool_ice_1363.md - keywords: cold cool ice

The stories are not coherent (not even as coherent as I'd hoped, which wasn't a high bar) but the different keywords did yield different results. Here are some examples to compare:

keywords "bird prey"

Uranus, Neptune--each of bird calls and steam."

"We know I'd been stopped at feet, and equally significant changes its prey, going too.

York sidereal time spectroscopic binaries known, including how I'd want me away until you're too sorry now.

Allan thinks so funny in Diana's voice indicated her so."

"Well now--if it needn't prey any garden rakes?" stammered Matthew.

"Very well! suppose E magnifying is vastly more being we thus cause it needn't prey any anchorage; and oh, Marilla, there really thought crossed two spectra furnished them after you could assign, were given directly in appearance simultaneously as follows: 2h. 44.5m. First Contact was continually shifted relative state is shown from flowing into pendants, and, although Saturn a prey to them," replied Conseil, "or the curious bird. Conseil rose must contain some hours."

Uranus we might recite it needn't prey any question the curious bird. Conseil appeared.

keywords "cool cold ice"

What good only exclusively marine quadruped. This correction ought only ties which now seventy-four years wherever science led. Never once told her hard Macadam, water running diagonally across water rapidly when they're thought they'd look too cold the ice; a cold, gray December evening, and perpetual cold, mitigated by astronomers, but your place are middling—a sorry because nobody ever built, let each group covers an ice had heard moving. My eyes, bearing upon No.

"I'd let any relation of cold."

From Fig. determine numerically their cold, fleshless hands is broken. You'll feel bad?"

"I'm real natural curiosities of "ice blink." However thick is based lies above sea level--i. e., followed these ice at White Sands, and cold, sinkaway feelings as cold when an ice-field, the ice-field, crushing it wrong after nightfall. But his body." "My ankle," gasped Anne.

If using keywords that do not appear in the text sources

The program informs the user if none of the keywords appear in the text sources and the program halts. Here is an example of what is printed on screen for user input keywords blog and influencer (words that you can reasonably assume would not be present in Project Gutenberg books):

Getting text...

Enter your desired keywords separated by spaces: blog influencer
Sorting out words...
* I don't have any information in my data involving "blog" and "influencer". Therefore "blog" and "influencer" won't appear in the final output

Sorry, I can't continue. I don't have information in my data about any of the keywords you entered

NaNoGenMo / 2024