Closed neilb closed 10 years ago
Have we talked about it? Did I not say this is something for the metacpan api?
Anyway, I have found code I wrote some months ago that (hopefully) did what you need:
for my $status (qw(latest cpan)) {
# Yes, we have registered this api usage at https://github.com/CPAN-API/cpan-api/wiki/API-Consumers
my $resp = $ua->get("http://api.metacpan.org/v0/release/_search?q=distribution:$dist%20AND%20status:$status&fields=author,name,status,version,version_numified&size=25&sort=date:desc");
# note: we need sort even for one hit so we can merge on that
if ($resp->is_success) {
my $content = $resp->decoded_content;
my $mcpananswer = $jsonxs->decode($content);
if (my $h = $mcpananswer->{hits}{hits}) {
push @hits, @$h;
} else {
warn sprintf "Warning: result from metacpan api has no hits: %s\n", $content;
}
}
}
Let me know if this is insufficient and what is still needed give that the metacpan api is such a friendly companion:)
Have we talked about it? Did I not say this is something for the metacpan api?
That must have been someone else.
Anyway, I have found code I wrote some months ago that (hopefully) did what you need [deletia]
Thanks for that -- I'll have a play with that later and let you know how I get on.
Thanks for the nudge towards MetaCPAN (slaps self). Using the new client I've written some initial code to generate the index I want. Next:
So, looks like there are 75 dists where the latest release isn't on MetaCPAN. For example Tim Bunce's Test::WriteVariants
, which he worked on at the QAH. You can find it on search.cpan.org (released 22nd March), but it can't be found on MetaCPAN.
Some of the missing ones are very old releases, but a number (like the one above) are very recent. I'll raise the list with MetaCPAN and see if they can be fixed, but it would be an ongoing issue.
Olaf wants a script they can run automatically to spot these kind of gaps. I've written a first pass, but it can't find missing developer releases, because there isn't a PAUSE index I can check against for developer releases :-)
So an index might be useful, but for slightly different reasons.
I have lots of complicated feelings about new indices. I hope what I am about to write makes some sense:
$primary_key $hunk_of_JSON
has good performance with binary search and an XS JSON parser. It's also more robust against weird filenames or whatever that might have whitespace in it.Specifically, to the proposal for an index of distributions, I think there is value to having -- at the very least -- a definitive list of the upload date of every distribution on PAUSE. Relying on BackPAN file timestamps and/or the "find . -ls" style dumps is tricky and brittle. I have mixed feelings for whether such an index should be of "distributions" or "all files on CPAN". I care mostly about the former.
Things I'd want to know about a distribution:
I don't terribly care about size but don't object to that being included. Other than 'indexed' status, the other fields apply to any file on CPAN.
In particular, it would be really nice for some purposes (e.g. an index-data-server) to be able to binary search for hashes in a single file instead of loading individual CHECKSUMS files for each author. But on the other hand, one enormous file is very slow to mirror (but possibly not too bad to rsync).
Aside: it would be nice to have the equivalent of CHECKSUMS that don't require perl parsing.
I would want to know which packages a distribution contains -- what would have been indexed if it were indexed. Because this is a complex, many-many mapping problem and would lead to a rather large file, it may not need to be in the same index as the path/timestamp/indexed/size index.
If we're really thinking about the many-many problem and sticking with flat-files, we really want two of them. One keyed on distribution and one keyed on "package-version" to answer both "what was in distribution X?" and "what distributions provide package Y and version constraint Z?".
A lot of good thoughts to process there. I've got a wee bit sidetracked while generating the index and writing a script to find gaps. And wrote a script to find releases where the metadata dist name doesn't match what CPAN::DistnameInfo expects. And now I'm working through those 170 dists, sending bug reports / emails / pull requests as appropriate. Andreas: you saw the talk where I confessed to my yak-shaving tendencies :-)
Anyway, an initial thought occurred. The PAUSE / CPAN / CPAN services model is something along the lines of the following. I'm organising my thoughts as much as anything here.
Given that things occasionally seem to get dropped between mirrors, it would be useful for this type of CPAN service to have a file where PAUSE essentially says "ok, here's everything I think you should have". This would let services like MetaCPAN "fill in the gaps", when they miss things.
There are a few extra pieces of information that only PAUSE really knows at the moment, which it would be good to make available to other services, such as "failed permissions check (on Foo::Bar)".
And as Andreas suggested, higher-level things {are then,can then be} provided by CPAN services, and not by PAUSE.
So often what I really want is a nicely structured RDBMS with all CPAN release history in it, and a clean RESTful API to that, so I can slice and dice to my heart's content. Given MetaCPAN is built on Elasticsearch, there are certain things you just can't do with it.
David Golden notifications@github.com writes:
[...] • I like the idea of making more hard-to-get information available
+1
• I wish we could step back and aggregate all the questions that are being asked and consider how well those map to various indices we have or are proposing
+1
I will not add a new index without giving you opportunity to veto and suggest alternatives.
Specifically, to the proposal for an index of distributions, I think there is value to having -- at the very least -- a definitive list of the upload date of every distribution on PAUSE.
I'm reluctant on this, disk is cheap, run an rrr job, take the timestamp from your disk. Alternatively, ask metacpan API.
Relying on BackPAN file timestamps and/or the "find . -ls" style dumps is tricky and brittle.
Backpan has no rrr support, so there is currently only metacpan API.
I have mixed feelings for whether such an index should be of "distributions" or "all files on CPAN". I care mostly about the former.
Things I'd want to know about a distribution:
• path • upload timestamp • file hashes (MD5, SHA-whatever)
All three are covered with rrr. File hashes are not on the metacpan API afaik, but when you use rrr, you have the guearantees of the rsync program which is damn good software.
• was it indexed or not?
That is not a boolean question as you suggest, it would be a set of time intervals. Kind of This package was indexed from date_0 to date_1, from date_2 to date_3, etc. You (and me) get this only since the 02packages file is checked into git (Paris hackathon). Aside: if you try to work with this repo you will start wishing for a faster computer.
I don't terribly care about size but don't object to that being included. Other than 'indexed' status, the other fields apply to any file on CPAN.
In particular, it would be really nice for some purposes (e.g. an index-data-server) to be able to binary search for hashes in a single file instead of loading individual CHECKSUMS files for each author. But on the other hand, one enormous file is very slow to mirror (but possibly not too bad to rsync).
Aside: it would be nice to have the equivalent of CHECKSUMS that don't require perl parsing.
I suppose that a bit of hacking around https://github.com/demerphq/Data-Undump will do.
I would want to know which packages a distribution contains -- what would have been indexed if it were indexed. Because this is a complex, many-many mapping problem and would lead to a rather large file, it may not need to be in the same index as the path/timestamp/indexed/size index.
If we're really thinking about the many-many problem and sticking with flat-files, we really want two of them. One keyed on distribution and one keyed on "package-version" to answer both "what was in distribution X?" and "what distributions provide package Y and version constraint Z?".
Pause would not have the answer stored anywhere. I believe, metacpan api exposes more that would answer these questions than pause.
Neil Bowers notifications@github.com writes:
[...] Given that things occasionally seem to get dropped between mirrors, it would be useful for this type of CPAN service to have a file where PAUSE essentially says "ok, here's everything I think you should have". This would let services like MetaCPAN "fill in the gaps", when they miss things.
rrr is this thing. AKA File::Rsync::Mirror::Recent. Requests about what needs to be documented better are welcome:)
There are a few extra pieces of information that only PAUSE really knows at the moment, which it would be good to make available to other services, such as "failed permissions check (on Foo::Bar)".
Yes, pause generates it, sends the information to the uploader and me and throws it away. If we want to get hold of it, we must start from scratch. It's never too late:)
So often what I really want is a nicely structured RDBMS with all CPAN release history in it, and a clean RESTful API to that, so I can slice and dice to my heart's content. Given MetaCPAN is built on Elasticsearch, there are certain things you just can't do with it.
As for the RDBMS, you can always download from pause a recent mysql dump and load it into mysql:
-rw-r--r-- 1 root root 17894439 2014-04-07 19:47:05 moddump.current.bz2 -rw-r--r-- 1 root root 128936230 2014-04-07 19:47:05 moddump.current
https://pause.perl.org/pub/PAUSE/PAUSE-data/
This used to be an rsyncable file but is not any more. We could probably resurrect the rsyncability if there's the need.
andreas
We could certainly add file hashes to the MetaCPAN API. I could see the value in that.
On Mon, Apr 7, 2014 at 9:11 PM, andk notifications@github.com wrote:
Specifically, to the proposal for an index of distributions, I think there is value to having -- at the very least -- a definitive list of the upload date of every distribution on PAUSE.
I'm reluctant on this, disk is cheap, run an rrr job, take the timestamp from your disk. Alternatively, ask metacpan API.
Relying on BackPAN file timestamps and/or the "find . -ls" style dumps is tricky and brittle.
Backpan has no rrr support, so there is currently only metacpan API.
Well, Robert and Ask were running a find . -ls cron job on backpan for me. While disk is cheap, for any repeat analysis, walking the filesystem each time for timestamps is expensive. It seems like the right place to aggregate it once is on PAUSE -- and then people who want timestamps on stuff on backpan don't have to rsync all of backpan for the privilege. Yes, it could be done elsewhere, MetaCPAN or otherwise, but I'd like to have one authoritative source and have others serve that up in various ways.
• path • upload timestamp • file hashes (MD5, SHA-whatever)
All three are covered with rrr. File hashes are not on the metacpan API afaik, but when you use rrr, you have the guearantees of the rsync program which is damn good software.
rrr is not the easiest thing to interrogate, though.
• was it indexed or not?
That is not a boolean question as you suggest, it would be a set of time intervals. Kind of This package was indexed from date_0 to date_1, from date_2 to date_3, etc. You (and me) get this only since the 02packages file is checked into git (Paris hackathon). Aside: if you try to work with this repo you will start wishing for a faster computer.
Right. Distributions are not indexed, packages are -- and you could have some packaged indexed and some not. The summary boolean that I'd want is "was it considered a dev release" by any of the mechanisms that would prevent any package from being indexed.
The partial indexing problem can be added to the edge cases of the many-many mapping problem.
If we're really thinking about the many-many problem and sticking with flat-files, we really want two of them. One keyed on distribution and one keyed on "package-version" to answer both "what was in distribution X?" and "what distributions provide package Y and version constraint Z?".
Pause would not have the answer stored anywhere. I believe, metacpan api exposes more that would answer these questions than pause.
Historically, no. Eventually, I'd like it do so, but that's for a future QAH, I'm sure. :-)
David
David Golden xdg@xdg.me Twitter/IRC: @xdg
First releases of CPAN::Releases::Latest is on CPAN.
I'm going to close this ticket. I think there are some related issues that it would be good to discuss, but not here.
For a number of my projects (the particular one prompting this request is the CPAN Dashboard) I'd really like a PAUSE-generated index that contains the latest release of all dists on CPAN, and where the latest is a developer release, then also include the latest stable release.
This might be something like the following (path, epoch-based upload time, size)
Or perhaps should include a JSON fragment with package info:
More mulling can be seen in this gist.