eddiejaoude / http-archive-crawler

Powered by 'HTTP Archive' & 'Web Page Test'
3 stars 0 forks source link

Connect to HTTP Archive #5

Open eddiejaoude opened 11 years ago

eddiejaoude commented 11 years ago

Send crawled URLs to HTTP Archive

TheOpsMgr commented 11 years ago

(1) Save the crawls into a text file .txt (or some other unique name) (2) submit to HTTPArchive by invoking batch_start.php

php batch_start <importflag 0 or 1>

correct example should be

php batch_start /var/www/httparchive/httpdocs/bulktest/run.txt IE8 0 BBC-Run1

(3) for the label use the same unique ID as the file name

(4) Correct response should be "DONE submitting batch run"

(5) caveat #1 - this is "single threaded" in the sense that currently only one batch can run at once - if you call batch_start when there is already a batch running you clobber the previous batch it seems... there is supposed to be a check for this but it doesn't appear to be working...

(6) caveat #2 - nothing actually happens until you call batch_process.php (repeatedly every x minutes) until the batch finishes. We need to setup chron jobs for this.

(7) caveat #3 - even after the batch has finished processing you then to call updatestats.php for the data to be calculated and added to the database.

TheOpsMgr commented 11 years ago

just to clarify the syntax -

php batch_start <import url flag 0|1>

The WPT Location ID basically needs to be IE8 upper-case, even though it will be invoking IE9 (this is a bug I think but we'd need to check all the HTTPArchive code to ensure that any location ID is correctly parameterised)

import save flag - use 0 for now

sub-label - needs to be unique or nothing happens!!!

TheOpsMgr commented 11 years ago

https://code.google.com/p/httparchive/source/browse/trunk/bulktest/README.txt

The description of included files:

bootstrap.inc: Configure the environment of execution batch_lib: The collection of all the functions needed by batch testing batch_start: Start a new batch testing batch_process: Peform all the tasks of a batch testing

How to make the batch running?

a) run "php batch_start" to kick off a new batch testing. It will detect whether there is a batch testing running in the system. If there is, it will kill it. It will read the input URL file, create the MySQL tables if necessary and the corresponding records. It will also print a summary of the previous batch testing before starting a new batch.

b) run "php batch_process" repeatly to perform a single batch testing. In each run, the script forks some subprocesses each of which is in charge of the tests in a specified status and try to move all the tests in this status to the next step. Once upon a completion of running, a summary of the batch will be printed. This script also guarantees that there is no other instance running when it starts. If there is, it exits.

To automate the whole periodic batch testing, you could schedule batch_process.php to run hourly in cron - if there's nothing to do it just exits. batch_start.php could be triggered manually or scheduled in cron to run every 2 weeks or whatever the interval for testing would be.

eddiejaoude commented 11 years ago

caveat #1 - this is "single threaded" in the sense that currently only one batch can run at once - if you call batch_start when there is already a batch running you clobber the previous batch it seems... there is supposed to be a check for this but it doesn't appear to be working...

Is this still the case? Only one batch at a time? This is worse than 'single threaded' if by running another independent thread causes issues with another.

TheOpsMgr commented 11 years ago

Yes, as it stands right now only one batch can run at once unless we write a patch for HTTPArchive.

It’s on Souder’s “to do” list…

Basically there is a table which contains all the information about the running batch and I don’t think that has a key unique to that batch so there is no way to distinguish batch 1 from batch 2 (alternatively you could create a unique temp table per batch).

Ditto when you calculate the stats – I am not sure if it just processes “all the results that are there” as opposed to “everything for a run”.