book / BackPAN-Index

Provide an index of BACKPAN
https://metacpan.org/release/BackPAN-Index
Other
8 stars 11 forks source link

Update from "recent" indexes #25

Open schwern opened 10 years ago

schwern commented 10 years ago

CC @neilbowers

The slowest part of BackPAN::Index is building the database. The whole thing has to be downloaded, read and rebuilt every change.

If we had "recent" indexes like on CPAN, this could be done much faster.

BackPAN::Index::Create would be changed to...

  1. Build indexes which only go back to a certain date. An hour. A day. A week. A month.
  2. Put them on BackPAN mirrors like backpan-index-recent-1h.txt.gz

BackPAN::Index would be changed to...

  1. When building the database, note the newest file time as the age of the index (do not use the index file time since that will not accurately reflect the age of the backpan mirror it was built from) in a new table in the database.
  2. Try to retrieve the appropriate "recent" index (ie. if your database is 6 days old, get backpan-index-recent-1w.txt.gz).
  3. If not available, get the normal index file.
  4. Update from the file, ignore any file in the index which is older than the database.

What do you think?

neilb commented 10 years ago

This seems like a good idea. Presumably we wouldn't expect a BackPAN to generate all combinations of index type and recentness.

You could go some of this way by using the "ordered by timestamp BackPAN index", which is on backpan.cpantesters.org. When getting an update, you could seek to the last dist you added and then process from that point forward?

The downside is that you're pulling the whole index down every time, even if you only need the last day or even last few hours. But is this a temporary concern? Ie is average bandwidth available to users increasing much faster than the size of the index is increasing? Dunno, offhand.

schwern commented 10 years ago

In my experience, building the database has always been slow while bandwidth has been ever increasing. I was even able to work fine on a bus over a cell phone.

One additional performance improvement would be to work on it as a stream. This will benefit full index rebuilds on slow connections by downloading and building the database in parallel. When updating, the the "order by date descending" index can be used and BackPAN::Index can stop downloading and processing once it's reached a known file.

If I'm going to implement that, BackPAN::Index doesn't need the "recent" indices... though they wouldn't hurt.

book commented 10 years ago

I think the first step indeed would be to work on a full index file, simply weeding out unwanted files.

Rather than the database age, what about starting from the date of the youngest release in the database. The duplicates are already handled, and there are only a bunch of files actually older than the latest release in BackPAN.

So the work could be divided in two independent parts:

schwern commented 10 years ago

Yes, working from the youngest release is the intent.