Open MichaelPaulukonis opened 8 years ago
Both are in-progress, trying to get towards something useful from code I hadn't touched between April and yesterday. I had quite forgotten what on Earth I had done.
The tagspewer README has notes on expanding abbreviations - they're screwing up the tagging-as-templates-and-lexicon. Which is not what pos-tagging is for, so, that's my trouble.
I'm also wondering if the way I'm tagging the text is problematic -- I think I'm reading it line by line. But since pos-tagging relies on sentence context - lines will often be sentence fragments. I should look into that.
While POS can be done from individual words (as many words are unambiguously only one part of speech) it's obviously much more accurate with a sentence to work from (because there are also an awful lot of words that are ambiguous). Which is, I think, why NLTK by default treats line ends in plain text as whitespace, and looks for sentences rather than structure. I believe, from my recent poking around, that under the hood it grabs paragraphs (separated by a blank line) and then breaks those down into sentences.
Yeah, it was something that didn't occur to me until I was writing those notes, above.
I'm generally processing Gutenberg texts, or whatever. But they've got the pre-formatted line-breaks, because of the old assumption that nobody will ever be able to build a piece of software to break-lines on the fly. Or something.
I just checked, and I'm reading line-by-line, and pos-tagging each line. So I'll have to de-line-break them. Somebody probably has that wheel invented, somewhere...
Another example - the chance encounter of Neuromancer and Moby Dick.
Line-breaks are removed, hyphens-over-line-breaks are removed (always, uh...), and some contractions are expanded. Capitalization is wonky, and possessives and mid-sentence punctuation is bizarre, as are numbers and chapter headings, etc.
Moby Dick converted to a pos-tag template, with pos-tag replacement from Neuromancer.
And Neuromancer, converted to a pos-tag template, with pos-tag replacment from Moby Dick.
thus:
Call me Ishmael
appears as Stop we Yonderboy.
And The sky above the port was the color of television, tuned to a dead channel.
becomes THE prostration in This spears was some Conversation of two, published to a true Ahab.
NOTE: unless the tag-bag is monochromatic, this is a stochastic process, so the above represents one possible example only.
IT IS WHAT IT IS
https://gist.github.com/MichaelPaulukonis/fe24abad01b3bf80f3a8
Punctuation is crap. The tagspewer needs some work, and tests. Plus the new sentence-tokenizer I added solves some problems, but re-opens some old wounds.
@hugovk - let's call it a month. Even though there's 23 minutes to go....
So, while I am less than impressed with my own output this year (in contrast, I was delighted with my progress last year, even if it still fell short of expectations), this project has come a long ways, and has some ways to go. I think I'll even be implementing some text-cleanups I've been envisioning for about 4 or 5 years, now.
A curious sub-project would be to recreate the famous openeing sentence of Neuromancer multiple times with a given tag-bag lexicon per book n
, then on to the next tag-bag lexicon.
The sky above the port was the color of television, tuned to a dead channel.
=>
A sound that some magnitude flew the bomb on spasmodic, humped to a vast round.
THE prostration in This spears was some Conversation of two, published to a true Ahab.
As long as the sentences stay under 140 chars, that sounds like a bot-project....
Tagspewer is now public on npm: https://www.npmjs.com/package/tagspewer
The Neuromancer idea is in-progress as portskybot. Not complicated, but I was waiting on getting certain aspects of tagspewer working, and published.
Oh! I could really have used Tagspewer three years ago.
After considerable delay and a couple of intermediate projects, portskybot
is live @ https://twitter.com/portskybot
Repo: https://github.com/MichaelPaulukonis/portskybot
var template = 'DT NN IN DT NN VBD DT NN IN NN , VBN TO DT JJ NN .';
=>
A machine-made of the explosion said a answer of child, laminated to the overhead eight.
The code of the screen rolled the Sense with splinter, known to the black sunlight.
The suit behind the hand was a ship of iron, hunted to the much flight.
It looks like you have some spurious capitalization preserved. Is this to support proper nouns?
On Tue, Apr 5, 2016 at 9:39 AM Michael Paulukonis notifications@github.com wrote:
After considerable delay and a couple of intermediate projects, portskybot is live @ https://twitter.com/portskybot
Repo: https://github.com/MichaelPaulukonis/portskybot
var template = 'DT NN IN DT NN VBD DT NN IN NN , VBN TO DT JJ NN .';
=>
A machine-made of the explosion said a answer of child, laminated to the overhead eight. https://twitter.com/portskybot/status/717259800632037376
The code of the screen rolled the Sense with splinter, known to the black sunlight. https://twitter.com/portskybot/status/717204561560276992
The suit behind the hand was a ship of iron, hunted to the much flight. https://twitter.com/portskybot/status/717174330078208002
The technology up the solitaire was the throat up immortality, swiveled to some colorless corporation. https://twitter.com/portskybot/status/717340339464482818
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/dariusk/NaNoGenMo-2015/issues/169#issuecomment-205809136
Or other parts that began a sentence. I haven't done any extra cleanup on that. Been thinking about it, and maybe fixing a/an issues. I would also like to be able to white-list known multi-part names, or other known entities, but... that's a larger issue.
But I went 2 months without a commit, so I decided to go live with what I had, and then think about further tweaks.
Are you planning to attend demo night tomorrow?
On Tue, Apr 5, 2016 at 9:48 AM Michael Paulukonis notifications@github.com wrote:
Or other parts that began a sentence. I haven't done any extra cleanup on that. Been thinking about it, and maybe fixing a/an issues.
But I went 2 months without a commit, so I decided to go live with what I had, and then think about further tweaks.
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/dariusk/NaNoGenMo-2015/issues/169#issuecomment-205812944
Don't think I can make it; it's been a rough month.
Pos-tagging replacement.
There's a couple of steps that I want to automate a bit more -- it was a development of Kazemi's spewer.
Basically, it takes a text, does pos-tagging, then creates a "lexicon" - a lookup table of tags => words. It also takes a text, does pos-tagging and created a tempalte. Run the application on the template and the lexicon, and get a new text.
Results are funky for contractions and a number of other things, but it can be interesting. Sentence structure is maintained, and rough parts-of-speech should be the same, only replaced with other words that are the same p-o-s.
In theory.
I was playing with this back in April, but haven't touched it since then, so need to refresh myself.
I am also thinking of running the output through a mis-speller for some additional mis-direction (but that may be gilding an ugly lily).
This extract replaces the pos-tags of The Purple Cloud with the words from the pos-bank of Ginsberg's Howl.
and mis-spelled: