Closed GoogleCodeExporter closed 8 years ago
[deleted comment]
finished adding:
http://groups.google.com/group/alt.politics.usa/topics?start=0&sa=N
Original comment by tokyotech
on 12 Apr 2009 at 10:40
Puneet asked for Twitter, so I'm in the process of adding
http://twitter.com/SarahPalin . But I'm not sure how your crawler will handle
Twitter
- it has no threads, no replies, and "next page of threads" is AJAXed. Which
regexes
should go where so your crawler doesn't crash?
Original comment by tokyotech
on 13 Apr 2009 at 5:25
The crawler won't crash. You just need to make sure that "threadURL" will match
the
URL of the start page because the site is pretty much just a page of posts. You
can
add null regexes for anything that doesn't exist. You need to test to make sure
but I
think that's all you need to have in mind.
Original comment by andrewps...@gmail.com
on 13 Apr 2009 at 9:23
Puneet wanted Yahoo Groups, but to read a group, you need to be a member of that
group. So I guess we can't crawl Yahoo Groups, right?
Original comment by tokyotech
on 20 Apr 2009 at 5:54
Added Twitter. All that's left is Yahoo Groups... (read last comment).
Original comment by tokyotech
on 21 Apr 2009 at 11:26
Decided not to do Yahoo Groups. All the lively groups are member-read-only.
Added
Gizmodo instead. It's weird how the replies don't show up in the HTML source,
so I'm
only scraping for the first post right now.
Original comment by tokyotech
on 24 Apr 2009 at 3:07
This is the site I mentioned that Puneet had mentioned:
http://www.mail-archive.com/flexcoders@yahoogroups.com/
Original comment by andrewps...@gmail.com
on 24 Apr 2009 at 5:01
seems like this weird reply structure won't work since the replies have to be
on the
same page as the first post, right?
Original comment by tokyotech
on 24 Apr 2009 at 5:18
yes, you're right. Our regular expressions don't support this.
Original comment by andrewps...@gmail.com
on 24 Apr 2009 at 4:03
Fuck Yahoo. They can't get anything right.
Original comment by tokyotech
on 24 Apr 2009 at 7:52
Original issue reported on code.google.com by
tokyotech
on 12 Apr 2009 at 6:55