GravityLabs / goose

Html Content / Article Extractor in Scala - open sourced from Gravity Labs
http://gravity.com
Apache License 2.0
1.53k stars 322 forks source link

nytimes.com extraction problems #1

Closed mdorn closed 13 years ago

mdorn commented 13 years ago

I see that nytimes.com is on your list of sites that still need unit testing.

I've successfully installed Goose and have run the unit tests without trouble.

Here are two issues I found while trying to extract text from nytimes.com:

1) When you run the code as is on any nytimes article (try: http://www.nytimes.com/2010/12/20/opinion/20cohen.html), I get the following output:

INFO [main] (HtmlFetcher.java:203) - Initializing HttpClient
INFO [main] (DefaultRequestDirector.java:491) - I/O exception (java.net.SocketException) caught when processing request: Connection reset
INFO [main] (DefaultRequestDirector.java:498) - Retrying request
INFO [main] (DefaultRequestDirector.java:491) - I/O exception (java.net.SocketException) caught when processing request: Connection reset
INFO [main] (DefaultRequestDirector.java:498) - Retrying request
INFO [main] (DefaultRequestDirector.java:491) - I/O exception (java.net.SocketException) caught when processing request: Connection reset
INFO [main] (DefaultRequestDirector.java:498) - Retrying request
WARN [main] (HtmlFetcher.java:132) - Connection reset
INFO [main] (HtmlFetcher.java:159) - starting...
INFO [main] (HtmlFetcher.java:161) - HTMLRESULT is empty or null

I'm not a Java guy, but when I tried to do something similar in Python, I discovered that setting the User-agent header causes the response from nytimes.com to send a 301 status and never reaches a 200 status. If I comment out line 244 in HtmlFetcher.java in Goose in order to not set the User-agent header and run the code, it successfully gets a response with article text. But then I see the 2nd problem:

2) the extracted content omits the first paragraph of the article.

jiminoc commented 13 years ago

thanks for the issue submission, I'll take a look at it. NyTimes typically has a paywall that does some redirection with passing tokens around so maybe some shenanigans there.

mdorn commented 13 years ago

OK great -- look forward to seeing what you come up with, but note that this also happens with articles that are publicly accessible without requiring username and password.

jiminoc commented 13 years ago

hi mdorn, so you were right on the line 244 thing, that got me the HTML, however the extraction looks ok on my box.

is this not what you're seeing?

INFO main - FINAL EXTRACTION TEXT: WHAT is the winter solstice, and why bother to celebrate it, as so many people around the world will tomorrow? The word “solstice” derives from the Latin sol (meaning sun) and statum (stand still), and reflects what we see on the first days of summer and winter when, at dawn for two or three days, the sun seems to linger for several minutes in its passage across the sky, before beginning to double back.

Indeed, “turnings of the sun” is an old phrase, used by both Hesiod and Homer. The novelist Alan Furst has one of his characters nicely observe, “the day the sun is said to pause. ... Pleasing, that idea. ... As though the universe stopped for a moment to reflect, took a day off from work. One could sense it, time slowing down.”

Virtually all cultures have their own way of acknowledging this moment. The Welsh word for solstice translates as “the point of roughness,” while the Talmud calls it “Tekufat Tevet,” first day of “the stripping time.” For the Chinese, winter’s beginning is “dongzhi,” when one tradition is making balls of glutinous rice, which symbolize family gathering. In Korea, these balls are mingled with a sweet red bean called pat jook. According to local lore, each winter solstice a ghost comes to haunt villagers. The red bean in the rice balls repels him.

In parts of Scandinavia, the locals smear their front doors with butter so that Beiwe, sun goddess of fertility, can lap it up before she continues on her journey. (One wonders who does all the mopping up afterward.) Later, young women don candle-embedded helmets, while families go to bed having placed their shoes all in a row, to ensure peace over the coming year.

Street processions are another common feature. In Japan, young men known as “sun devils,” their faces daubed to represent their imagined solar ancestry, still go among the farms to ensure the earth’s fertility (and their own stocking-up with alcohol). In Ireland, people called wren-boys take to the roads, wearing masks or straw suits. The practice used to involve the killing of a wren, and singing songs while carrying the corpse from house to house.

Sacrifice is a common thread. In areas of northern Pakistan, men have cold water poured over their heads in purification, and are forbidden to sit on any chair till the evening, when their heads will be sprinkled with goats’ blood. (Unhappy goats.) Purification is also the main object for the Zuni and Hopi tribes of North America, their attempt to recall the sun from its long winter slumber. It also marks the beginning of another turning of their “wheel of the year,” and kivas (sacred underground ritual chambers) are opened to mark the season.

Yet, for all these symbolisms, this time remains at heart an astronomical event, and quite a curious one. In summer, the sun is brighter and reaches higher into the sky, shortening the shadows that it casts; in winter it rises and sinks closer to the horizon, its light diffuses more and its shadows lengthen. As the winter hemisphere tilts steadily further away from the star, daylight becomes shorter and the sun arcs ever lower. Societies that were organized around agriculture intently studied the heavens, ensuring that the solstices were well charted.

Despite their best efforts, however, their priests and stargazers came to realize that it was exceptionally hard to pinpoint the moment of the sun’s turning by observation alone — even though they could define the successive seasons by the advancing and withdrawal of daylight and darkness.

The earth further complicates matters. Our globe tilts on its axis like a spinning top, going around the sun at an angle to its orbit of 23 and a half degrees. Yet the planet’s shape changes minutely and its axis wobbles, thus its orbit fluctuates. If its axis remained stable and if its orbit were a true circle, then the equinoxes and solstices would quarter the year into equal sections. As it is, the time between the spring and fall equinoxes in the Northern Hemisphere is slightly greater than that between fall and spring, the earth — being at that time closer to the sun — moving about 6 percent faster in January than in July.

mdorn commented 13 years ago

I am seeing that now, yes -- I was sure the first paragraph was missing from that article when I tried it yesterday, but now I can't be sure.

This extracted text for this URL, however, does have the first paragraph missing:

http://www.nytimes.com/2010/12/22/world/europe/22start.html

It begins with "The Senate voted 67 to 28 to end debate ..." instead of "WASHINGTON — An arms control treaty paring back American and Russian nuclear arsenals ..."

jiminoc commented 13 years ago

hey Matt, I got the fix done for this new article had to redo the algorithm I use to calculate sibling nodes I just pushed the code, try getting latest and test that URL http://www.nytimes.com/2010/12/22/world/europe/22start.html again :)

thanks for taking the time to write the issue up.