Closed thethomaseffect closed 10 years ago
Thanks for the detailed report! I'll check it out.
I have a fix for this almost ready. I should have it posted soon.
Awesome, thank you for the speedy response!
Give v0.4.0 a shot. It won't fix the japanese characters, but it should fix the missing text.
New output:
$ curl -s "https://en.wikipedia.org/wiki/Now_and_Then,_Here_and_There" | unfluff | jq -r .text
Now and Then, Here and There (
) is a thirteen episode anime series directed by Akitaro Daichi and written by Hideyuki Kurata. The story was originally conceived by director Daichi. It premiered in Japan on the WOWOW television station on October 14, 1999 and ran until January 20, 2000. It was licensed for Region 1 DVD English language release by Central Park Media under the US Manga Corps banner. Following the 2009 bankruptcy and liquidation of Central Park Media, ADV Films picked up the series for a release on July 7, 2009. As of Sept. 1, 2009, the series is licensed by ADV's successor, AEsir Holdings, with distribution from Section23 Films.
Now and Then, Here and There follows a young boy named Shuzo "Shu" Matsutani who, in an attempt to save an unknown girl, is transported to another world which is possibly the Earth in the far future. The world is desolate and militarized, and water is a scarce commodity.
While walking home from a somewhat bad, but regular day of school, "Shu", the main protagonist, spots a girl on top of a smoke stack in an industrial park where he used to hang out as a young child. Shuzo tries numerous attempts to communicate with the young girl but she acts emotionless and quiet, and hardly acknowledges his presence. After decoding her name from her lips (Lala-Ru) the only other piece of information he finds out about her is her love of watching sunsets. There is a sudden explosion and time stops; Shu finds himself defending Lala-Ru from abductors in mechanized snakes. After attempting to defend the girl, he is caught in a transportation to the world from which the strangers hail, a wasteland devoid of water and dominated by a red giant star. Lala-Ru possesses a pendant containing a vast reservoir of water, and has the ability to control that water.
Shu is trapped in this new, harsh reality, and he is beaten and interrogated repeatedly inside the warship commanded by the ruthless, manic dictator, Hamdo. While locked in a cell he meets an abducted girl who introduces herself as Sara Ringwalt of America. Sara's reason for her capture was being mistaken for Lala-Ru by Hamdo's minions. Sara goes through extremely horrific experiences and eventually becomes emotionally scarred. After an assault by an unknown enemy landship, Shu is forced to join an army of child soldiers; children trained to for the looting of villages, in which they kidnap female villagers for breeding, and conscript orphaned male children into the ever dwindling ranks of Hamdo's army.
From the start, the series may seem lighthearted in nature, but this is far from the truth. Much of the series deals with serious moral issues relating to war, the consequences of war, slavery, and the exploitation of children.
Great, thanks! Anything you can do about the two newline characters added after (
? I've worked around it easily enough by matching one of my regex over multiple lines so it's not a huge priority.
I'm using unfluff as an easy way to grab the first few paragraphs of wikipedia articles to describe media. When I print the text returned from https://en.wikipedia.org/wiki/Now_and_Then,_Here_and_There I get:
At the start where the actual article gives:
The problem is almost certainly with the 今 character. I understand you know Asian text doesn't work very well. However, in this instance I'm losing a massive portion of English text. A simple fix for now would be just removing the offending character from the output or replacing it with the Unicode unknown character symbol.