Open schwern opened 10 years ago
This seems like a good idea. Presumably we wouldn't expect a BackPAN to generate all combinations of index type and recentness.
You could go some of this way by using the "ordered by timestamp BackPAN index", which is on backpan.cpantesters.org. When getting an update, you could seek to the last dist you added and then process from that point forward?
The downside is that you're pulling the whole index down every time, even if you only need the last day or even last few hours. But is this a temporary concern? Ie is average bandwidth available to users increasing much faster than the size of the index is increasing? Dunno, offhand.
In my experience, building the database has always been slow while bandwidth has been ever increasing. I was even able to work fine on a bus over a cell phone.
One additional performance improvement would be to work on it as a stream. This will benefit full index rebuilds on slow connections by downloading and building the database in parallel. When updating, the the "order by date descending" index can be used and BackPAN::Index can stop downloading and processing once it's reached a known file.
If I'm going to implement that, BackPAN::Index doesn't need the "recent" indices... though they wouldn't hurt.
I think the first step indeed would be to work on a full index file, simply weeding out unwanted files.
Rather than the database age, what about starting from the date of the youngest release in the database. The duplicates are already handled, and there are only a bunch of files actually older than the latest release in BackPAN.
So the work could be divided in two independent parts:
Yes, working from the youngest release is the intent.
CC @neilbowers
The slowest part of BackPAN::Index is building the database. The whole thing has to be downloaded, read and rebuilt every change.
If we had "recent" indexes like on CPAN, this could be done much faster.
BackPAN::Index::Create would be changed to...
BackPAN::Index would be changed to...
What do you think?