Open Bradley-Butcher opened 1 year ago
yes. all of this. agreed.
Let's just scrape @goth600's account
Hey @Bradley-Butcher this is what I was thinking:
Whatcha think
Sounds good, wondering if some larger scale pretraining on just bangers would be advantageous before the lower scale synthetic not banger/ banger conversion
Concerned about potential bias the gpt unbanging would inject but might not be an issue
Could sidestep the synthetic not-banger bias issue by having an intermediate banger format, maybe yaml, something like:
topic: LK-99, UFOs
edgy-magnitude: 8
mention: sam altman
banger: true
... other stuff
Use existing LM to extract the yamls, constrain generation according to schema.
Then train a schema -> tweet LM.
cba to design the schema properly, idk what should be in it
This website appears to contain tweet information: https://nitter.net/ This java application appears to grab tweet information from Nitter: https://github.com/yamin8000/TwitterScrapper
Does not appear to get following/followers; what kind of data do we need to collect? @codethazine
i think the other obvious thing to think about is defining a 'banger' in the first place.
something like (log likes) / (log followers) = banger index if a small account gets a huge amount of likes, it has to be a banger if a huge account gets small amount of likes, its trash
we need a twitter API obviously i dont think scraping or using a 3rd party will help. any thoughts?
Fair point, this is something we've discussed in #programming on the discord. This is the current state of the twitter API: https://twitter.com/codethazine/status/1687966569127682049?s=46 Problem is, the API isn't working currently.
Fair point on the definition of a 'banger', codethazine mentioned it last night. Think it should be determined by ratio of followers/likes
I think Nitter can be useful source of scraping (uses twitter backend, javascript free, no account required) if the API continues not to work, will probably have to write our own scraper to capture number of likes on tweets though, as it appears the scraper I linked previously doesn't record the number of likes.
I think the current point we're at is that there isn't an easy way to get a list of accounts a user follows, so we will probably need to curate a list of accounts to scrape for bangers.
if we go with (log likes) / (log followers) we'll soon realize some likes are better than others.
also we'll only get bangers from one-hit banger accounts (can't beat log 40k / log 50) and probably risk overfitting due to sparse data.
picking the highest banger density accounts fixes the latter. But by only filtering based on that, the same 40k followers giving 20k likes per tweet to "Be what you want to attract" accounts are the only thing our LM will be and attract.
So I'd just pick by hand accounts on-the-rise, already followed by all the cool kids, then apply Martin's formula.
Ok, so I curated a small group of 42 banger accounts on PR #20 - would love some feedback/integrations on that. I'll then proceed to:
get_num_followers.py
*get_last_100_tweets.py
, cycling through the users on data/banger_accounts.csv
and dump them on data/last_100_tweets_from_bangerers.csv
filter_bangers.py
, dumping them on final_bangers.csv
. If the number of banger_accounts stays at 42, I'm assuming we'll get around 1000 tweets on final_bangers.csv. More than that could be difficult, considering the 10K Twitter API limit on the basic plan. So banger data coming soon!
Also, came across this paper detailing LoRa vs full fine-tune. Considering our relatively small dataset, it might be best going for a full time-tune approach
ive secured an API key for this. can we make a twitter group chat? please DM me, and then maybe we can migrate that to discord.
Some finetuning results on OAI Curie: https://github.com/effectiveaccelerationism/text-to-banger/pull/24#issuecomment-1676015783
I'm pretty delighted with the results act, whatcha think?
How can we scrape bangers for LM training? What sort of criteria are we going for?
My take would be an initial like/follower ratio filter, topic filtering, with some sentiment analysis along the: edgy, meme, up-to-date commentary Pareto front.
Anyone got preemo twitter API access? Let's get Elon on board.
Prob easier to train a CLM to avoid having to obtain (not banger, banger) pairs.
If we're training a CLM Should probably include negative examples also with special \ and \ prefixes to avoid LLM bs. Can use my tweets for the negative bangers.
Should probably engineer some semi-online LoRa to ensure the LM maintains live bangers