Open altsoph opened 4 years ago
Also, there is my related NanoNaNoGenMo submission: https://twitter.com/altsoph/status/1200815956420890626
Also, there is my related NanoNaNoGenMo submission: https://twitter.com/altsoph/status/1200815956420890626
Thank's for making this available, was just reading through your code and your results and it looks really cool. I don't know if you have the time, but it would be incredible if you could include a quick how-to guide to get it up and running to experiment with a new corpus. I'm more than willing to help with this because I think that this would be really educational for a lot of people. Thanks again.
Thank's for making this available, was just reading through your code and your results and it looks really cool. I don't know if you have the time, but it would be incredible if you could include a quick how-to guide to get it up and running to experiment with a new corpus. I'm more than willing to help with this because I think that this would be really educational for a lot of people. Thanks again.
Thanks for your interest! I'm not sure, are you asking about NaNoGenMo or about NanoNaNoGenMo entry? The former is more or less described here https://github.com/altsoph/paranoid_transformer/blob/master/README.md, the latter -- here: https://medium.com/altsoph/123-bytes-perl-markov-chain-b80e1212f3b3 Feel free to ask any questions :)
Sorry for the latest joining, but I still believe it's worth to try on it this year's NaNoGenMo :) This month I tried to build a paranoiac-critical system based on two neural networks, Paranoid Transformer.
The first network is a Paranoiac-intrusive Generator and the second one, Critic, works as a filtering subsystem, so it selects the best ones from the flow of text passages.
Let me share some details:
Generator subsystem
The first network, Paranoiac-intrusive subsystem AKA Generator, uses an OpenAI GPT architecture and the implementation from huggingface. I took a publicly available network model already pre-trained on a huge fiction BooksCorpus dataset with approx ~10K books and ~1B words.
Next, I finetuned it on several additional handcrafted text corpora (altogether ~50Mb of text):
During the finetuning phase, I used special labels to tell the model which type of text it reads:
At last, in generation mode, I kindly asked mode to generate some
Critic subsystem
The next big thing to do was filter some real gems from this endless text flow.
At first, I made a script with some simple heuristic filters such as:
The application of this script cut the text flow into a sequence of valid chunks.
Here I used manual labeling of such chunks with two classes, GOOD/BAD. I took approx 1K chunks, balanced (one half of them were GOOD, the other half -- BAD).
At last, I trained the Critic subsystem. This neural network uses a BERT architecture implemented again by huggingface. Again I took a public available pre-trained network model and finetuned it on my labeled 1K chunks dataset to predict the label of any given chunk.
Finally, I made a pipeline which includes Generator subsystem, heuristic filters and Critic subsystem. Here it is a short sample of the final results:
The huge blob of generated text could be found here: https://github.com/altsoph/paranoid_transforner/blob/master/NaNoGenMo_50K_words_sample.txt