Closed Reapette closed 6 years ago
Learning works by reading the tweets of the parent account and adding them to the corpus text file. Learning will happen whenever the command line tool is run, assuming that a parent account has been set in the config file. Learning can be disabled with the --no-learn
option.
@fitnr Thanks! a couple of questions about learning then
1) How deep does it "go" into parent account every time command is run? Like, does it grab the last 10 tweets? Last 20?
2) Also, I reckon specifying the bot's own account as "parent" would cause it to suck its own tweets into the corpus, resulting in slow degradation of quality (with possible increase of fun), would that be correct?
twurl
. You could do a daily cron task like twurl '/1.1/statuses/user_timeline.json?count=N' | jq -r '.[].text' >> corpus.txt
, where N
is the number of tweets the markov account makes a day, and jq
is a command line json parser.@fitnr Thanks!
Two more questions (hopefully not very dumb :) )
When a config has >1 corpus specified, does it (from the point of view of the markov chain formation process) merge them into a single one, as if it was one file?
Does the order of lines in the file practically affect the way the bot treats them (as in, would changing the order in which lines are in the corpus(es) affect probability of it coming up with a particular phrase, given same state-size?)
corpus
is a list, the bot will read from all of the files listed.@Reapette correction: You can specify multiple texts, which will create multiple models, and you can choose to create texts from any of them (perhaps randomly). To create a combined corpus, just create a file that combines all the texts.
@fitnr Thanks for explaining.
So, specifying two texts instead of one will not cause it to create one model for one big corpus that is "text1.txt + text2.txt", but will create two models for two separate corpuses?
If that's so, having a "treat all text files as one giant corpus, create one big model" mode would be a great enhancement.
It would allow to create a bot that has one huge "main" corpus that is being updated incrementally via adding stuff to a small separate file, which is IMHO more manageable than adding stuff to an already huge file
I'm not sure if keeping track of many text files is clearly easier than one text file. And since there are many, many ways to create one file on the fly, it doesn't seem like a pressing need. Why not just have a daily or hourly cron job that cat
s all the source files into a mega-corpus.txt
? You can read from that with --no-learn
, then run the script to learn but not tweet into the smaller file.
One change that I've already added to HEAD is to allow the bot to read from any file-like object, so using the Python would allow you to write a script that reads from an arbitrary source.
Hi! Two questions:
1) when I want to manually update the corpus, would just plain old adding lines of text work fine "out of the box", or would I need to carry out some re-learn/update process?
2) there is a function that allows to keep learning from another twitter acc
Documentation states:
# If you want your bot to continue to learn, include this parent: your_screen_name
How does the learning "work"? Do tweets from that account get gradually added to the txt file specified in the config as corpus? Or do they live "someplace else"? How many "parent's tweets" are added "per cron run"?