PiRSquared17 / abot

Automatically exported from code.google.com/p/abot
Apache License 2.0
0 stars 0 forks source link

Suggested feature: More configurable logging w/ log4net #107

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This is great - I've been toying around with it for a few hours and I'm really 
loving how fast it is to get up and configure. 

1) My needs for a crawler are to crawl a local version of a site that has 5000+ 
pages, and I only really care about the non-200 responses (404, 500, etc). Abot 
is currently not configurable (that I could find) in this way - it will "log 
everything". I'd love to see a configuration option that says, 
"LogInformationalMessages" (bool) and another that says 
"LogAllNon200HttpStatusResponses" (bool). I've modded my source to have those 
and, when using the log4net to write to SQL, it makes it far easier to view 100 
rows (that are 404/500/etc) that writing complex queries to sift through 5000+ 
rows each time.

2) My other request would be to atomize the logging so we can have separate 
columns for each piece of information. Currently a 404 looks like this in the 
log:

Page crawl complete, Status:[404] 
Url:[http://localhost:1000/MyFolder/MyFile.aspx?MyQS=165&AnotherQS=2800] 
Parent:[http://localhost:1000/MyFolder/123/MyFile.aspx]

That's just very, very hard to parse in SQL. I'd prefer to have it with a SQL 
table structure that's more like this:

CREATE TABLE dbo.log4NetResult(
    LineId int NOT NULL IDENTITY(1,1),
    Date datetime NOT NULL,
    Thread varchar (255) NOT NULL,
    Level varchar (50) NOT NULL,
    Logger varchar (255) NOT NULL,
    Message varchar (4000) NOT NULL,
    Exception varchar (2000) NULL,
    FullUrl VARCHAR(1024) NOT NULL,
    UriStem varchar(512) NOT NULL,
    UriQuery varchar(512) NULL,
    Port int NOT NULL,
    ParentPage varchar(1024) NULL,
    HttpStatus int NOT NULL,
    TimeTaken int NOT NULL,
 CONSTRAINT PK_log4NetResult PRIMARY KEY NONCLUSTERED (LineId) 
)
GO

This way I could easily query and say, "Show me all 404s in the past day" or 
"Show me all 404s that had a uri querystring that contains the value '123'". It 
would just be far easier.

Hope this is helpful. Keep up the great work!

Original issue reported on code.google.com by scott.wh...@gmail.com on 29 May 2013 at 4:30

GoogleCodeExporter commented 9 years ago
Thanks for taking the time to give feedback!

*My understanding your requirement...*
-Store non 200 http responses in the db

*Have you considered...*
-Subscribe to the PageCrawlCompleted event
-In the event handler Check the e.CrawlPage.HttpResponse.Status to see if
its ok/200
-If its non 200 then insert any data about the page that you want into your
db

Seems like your trying to use the current logger (that should mostly be
used for debugging) and adding a db appender which parses the log data when
it would easier to just call your data access class directly from the
PageCrawlCompleted event.

Original comment by sjdir...@gmail.com on 30 May 2013 at 3:18

GoogleCodeExporter commented 9 years ago

Original comment by sjdir...@gmail.com on 30 May 2013 at 3:36

GoogleCodeExporter commented 9 years ago
Good info - thanks. Got it.

Original comment by scott.wh...@gmail.com on 30 May 2013 at 7:33

GoogleCodeExporter commented 9 years ago
As my comment suggestion seemed to address the user's use case, marking this as 
wont fix.

Original comment by sjdir...@gmail.com on 29 Jun 2013 at 7:35