commercialhaskell / stackage

Stable Haskell package sets: vetted consistent packages from Hackage
https://www.stackage.org/
MIT License
530 stars 805 forks source link

snapshot pages take a long time to fully populate when initially published #5352

Open AMDmi3 opened 4 years ago

AMDmi3 commented 4 years ago

I regularly fetch https://www.stackage.org/nightly file for Repology project to track haskell package versions in Stackage, and it turns out that the file is generated incompletely some times, which can be seen on the graph:

https://repology.org/graph/repo/stackage_nighly/projects_total.svg

juhp commented 4 years ago

Thank you for reporting this. I know for a while there were some stability problems behind the CloudFlare, not sure if that has improved on the stackage server. Maybe @snoyberg or others can comment. (Otherwise retrying might be the best workaround for now.)

Repology is a very cool project, thank you

AMDmi3 commented 4 years ago

It seems to me that the issue is the file generation rather than distribution, because XML parser doesn't fail on my side, which would mean the file is not truncated, but a valid complete html with incomplete set of entries. That is why retrying doesn't help as file parsing always succeeds.

juhp commented 4 years ago

Btw would you be better off downloading the json from https://github.com/commercialhaskell/stackage-snapshots/tree/master/nightly/ ? That repo is actually the source of truth for Stackage snapshot package versions.

Anyway looking at the data more carefully I finally understood what you are seeing:

So we don't guarantee there is a snapshot every single day(night) - it is best effort. :-) ie Some days there is no new snapshot generated due to broken dependencies etc.

So for example with https://www.stackage.org/nightly-2020-05-09 : there was no snapshot on 2020-05-09! That page could really be a 404 or something. So on those days it would be better just to fallback to the previous snapshot.

Hope that helps.

AMDmi3 commented 4 years ago

Btw would you be better off downloading the json

~Hmm, this make sense, will rewrite parser for it.~

Nah, it's not suitable as it takes too much time to download both as a repository and as a snapshot. Would be nice if there were symlinks pointing to the latest nightly and lts snapshots, as long as github follows these and allows to download the latest snapshot as a blob this way.

Or I could stay on parsing HTML, but it would be nice if html was never generated for incomplete snapshot.

juhp commented 4 years ago

I believe https://www.stackage.org/nightly and https://www.stackage.org/lts both redirect to the latest nightly and lts snapshots.

AMDmi3 commented 4 years ago

That's what we began with - if snapshot creation fails, as you say, these are generated with incomplete set of packages (instead of staying at the previous snapshot data).

juhp commented 4 years ago

AFAIK that is not true, https://www.stackage.org/nightly should always redirect to the latest available nightly snapshot. If you can show HTTP output to the contrary we will surely look into it more.

AMDmi3 commented 4 years ago

I've shown the link to graph in the first comment which shows fluctuation of number of packages. To demonstrate the problem we'll have to wait for it to repeat itself.

juhp commented 4 years ago

Sure we can leave this open until you are able to provide more detailed information, thanks

AMDmi3 commented 4 years ago

Here it is: right now, https://www.stackage.org/nightly contains 1298 packages. Before, there was 2483 packages: https://www.stackage.org/nightly-2020-05-16

juhp commented 4 years ago

Hmm I currently see https://www.stackage.org/nightly-2020-05-19 with 2494 packages. www.stackage.org is a Yesod webapp - I believe the pages are generated by the server direct from stackage snapshot data in a db.

AMDmi3 commented 4 years ago

This is what I see:

% curl -si https://www.stackage.org/nightly-2020-05-19 | head -42
HTTP/2 200 
date: Tue, 19 May 2020 16:43:43 GMT
content-type: text/html; charset=utf-8
set-cookie: __cfduid=d17963728f958904b0df209be13ec0e1c1589906623; expires=Thu, 18-Jun-20 16:43:43 GMT; path=/; domain=.stackage.org; HttpOnly; SameSite=Lax; Secure
vary: Accept, Accept-Language
x-xss-protection: 1; mode=block
cache-control: public, max-age=43200
strict-transport-security: max-age=15724800; includeSubDomains
cf-cache-status: HIT
age: 35216
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
server: cloudflare
cf-ray: 595f474ecac64989-DME
cf-request-id: 02cf6ae53f000049895a820200000001

<!doctype html><!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en"> <![endif]--><!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en"> <![endif]--><!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js" lang="en"> <!--<![endif]--><head><meta charset="UTF-8"><link href="//fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css"><link rel="search" type="application/opensearchdescription+xml" title="Stackage.org hoogle" href="/static/opensearchhoogle.xml"> </link><link rel="search" type="application/opensearchdescription+xml" title="Stackage.org package page" href="/static/opensearchpackage.xml"> </link><title>Stackage Nightly 2020-05-19 (ghc-8.8.3) :: Stackage Server</title><meta name="og:site_name" content="Stackage"><meta name="twitter:card" content="summary"><meta name="og:title" content="Stackage Nightly 2020-05-19 (ghc-8.8.3)"><meta name="viewport" content="width=device-width,initial-scale=1"><link href="https://www.stackage.org/feed" type="application/atom+xml" rel="alternate" title="Recent Stackage snapshots">
<link rel="stylesheet" href="https://www.stackage.org/static/combined/d9jEluDQ.css"><style>form.hoogle{margin-bottom:20px}form.hoogle .search{width:25em}form.hoogle input{margin-bottom:0}.exact-lookup{display:inline-block;margin-left:1em}h1{font-size:30px !important;margin-bottom:0}h1 + p{margin-top:0}h2{color:#555 !important}.date{font-size:15px;line-height:15px}hr{border:1px solid #ddd
}.separator{width:1px;height:0.5em;background:#aaa;display:inline-block;margin:0 0.5em}.accordion-group{border:0}.accordion-group a.accordion-toggle:hover{text-decoration:none;background:#f5f5f5;border-radius:0.5em}.accordion-group .accordion-toggle code{margin:0}.accordion-group .accordion-toggle{cursor:default;margin-left:-0.5em;padding-left:0.5em;color:#555}.accordion-group .accordion-toggle code{font-size:inherit;border:inherit}.accordion-group .accordion-toggle .number{border-radius:1em;line-height:1.5em;background:#0981c3;color:#fff;display:inline-block;padding:0 0.5em;margin-right:0.5em;text-shadow:none}h3{color:#666 !important;font-weight:normal
}h3 > small{color:#666
}p + ul{margin-top:1em}.bottom-links{margin-top:0.5em;border-top:1px solid #ddd;padding-top:0.5em}.packages > .table td,.packages > .table th{padding-left:0}.keyword{color:#366354 }.url{color:#06537d }.stack-resolver-yaml{font-size:1.3em;font-weight:600}.cabal{font-size:0.9em}html{position:relative;min-height:100%}body{background:#f0f0f0;font-family:'Lato', sans-serif;text-shadow:1px 1px 1px #ffffff;margin-bottom:4em;padding-bottom:2em}code,pre{color:#555;font-family:"ubuntu mono", monospace}.brand > img{height:20px}.navbar-inverse{margin-bottom:1em}.navbar-inverse .navbar-inner{background:#0981c3}.navbar-inverse .navbar-inner .btn-navbar{background:#0981c3}.navbar-inverse .navbar-inner *{color:#fff !important
}.navbar-inverse .nav .active>a,.navbar-inverse .nav .active>a:hover,.navbar-inverse .nav .active>a:focus{background:#0981c3 !important}.navbar-inner{border-color:#06537d !important}.footer{text-shadow:none;background:#0981c3;border-top:1px solid #ddd;color:#fff;position:absolute;bottom:0;left:0;width:100%;height:4em;line-height:2em;text-align:center}.footer a{color:#fff;font-weight:bold}.footer .span12{padding:0px 15px 0 0;line-height:4em}.alert{margin-top:1em}h1,h2,h3,h4,h5{color:#06537d
}.content{font-size:16px;line-height:30px}</style><!--[if lt IE 9]><script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script><![endif]--><script>document.documentElement.className = document.documentElement.className.replace(/\bno-js\b/,'js');</script></head><body><div id="main" role="main"><div class="navbar navbar-inverse navbar-static-top"><div class="navbar-inner"><div class="container"><button class="btn btn-navbar" type="button" data-toggle="collapse" data-target=".nav-collapse"><span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="brand" href="/"><img src="/static/img/stackage.png" title="FP Complete">
</a>
<div class="nav-collapse collapse"><ul class="nav"><li><a href="https://www.stackage.org/snapshots">Snapshots</a>
</li>
<li><a href="https://www.stackage.org/blog">Blog</a>
</li>
</ul>
</div>
</div>
</div>
</div>
<div class="container"><div class="container content"><h1>Stackage Nightly 2020-05-19 (ghc-8.8.3)</h1><p>Published on 2020-05-19<span class="separator"></span><span><a href="https://www.stackage.org/diff/nightly-2020-05-16/nightly-2020-05-19">View changes</a></span><span class="separator"></span><span>stack <code>resolver: nightly-2020-05-19</code></span></p><h3>Setup guide</h3><p>Edit your stack.yaml and set the following:</p><p class="stack-resolver-yaml">resolver: nightly-2020-05-19</p><p>You can also use <code>stack --resolver nightly-2020-05-19</code> on the command line</p><p><b>New to stack?</b> Check out <a href="http://docs.haskellstack.org">the stack homepage</a></p><h3>Hoogle</h3><form class="hoogle" action="https://www.stackage.org/nightly-2020-05-19/hoogle"><input class="search" type="search" autofocus name="q" value="" placeholder="Hoogle Search Phrase">
<input class="btn" type="submit" value="Search">
<label class="checkbox exact-lookup" for="exact" title="Only find identifiers matching your search term precisely"><input type="checkbox" name="exact" id="exact">
Exact lookup</label>
</form>
<h3>Packages (1298)</h3><p><a href="https://www.stackage.org/nightly-2020-05-19/docs">View documentation by modules</a></p></div><div class="container content"><div class="packages"><table class="table"><thead><th>Package</th><th>Synopsis</th></thead><tbody><tr><td><a class="package-name" href="https://www.stackage.org/nightly-2020-05-19/package/abstract-deque-0.3">abstract-deque-0.3</a></td><td>Abstract, parameterized interface to mutable Deques</td></tr><tr><td><a class="package-name" href="https://www.stackage.org/nightly-2020-05-19/package/abstract-par-0.3.3">abstract-par-0.3.3</a></td><td>Type classes generalizing the functionality of the &#39;monad-par&#39; library</td></tr><tr><td><a class="package-name" href="https://www.stackage.org/nightly-2020-05-19/package/AC-Angle-1.0">AC-Angle-1.0</a></td><td>Angles in degrees and radians</td></tr><tr><td><a class="package-name" href="https://www.stackage.org/nightly-2020-05-19/package/accuerr-0.2.0.2">accuerr-0.2.0.2</a></td><td>Data type like Either but with accumulating error type</td></tr><tr><td><a class="package-name" href="https://www.stackage.org/nightly-2020-05-19/package/ace-0.6">ace-0.6</a></td><td>Attempto Controlled English parser and printer</td></tr><tr><td><a class="package-name" href="https://www.stackage.org/nightly-2020-05-19/package/action-permutations-0.0.0.1">action-permutations-0.0.0.1</a></td><td>Execute a set of actions (e.g. parsers) in each possible order</td></tr><tr><td><a class="package-name" href="https://www.stackage.org/nightly-2020-05-19/package/active-0.2.0.14">active-0.2.0.14</a></td><td>Abstractions for animation</td></tr><tr><td><a class="package-name" href="https://www.stackage.org/nightly-2020-05-19/package/ad-4.4">ad-4.4</a></td><td>Automatic Differentiation</td></tr><tr><td><a class="package-name" href="https://www.stackage.org/nightly-2020-05-19/package/adjunctions-4.4">adjunctions-4.4</a></td><td>Adjunctions and representable functors</td></tr><tr><td><a class="package-name" href="https://www.stackage.org/nightly-2020-05-19/package/adler32-0.1.2.0">adler32-0.1.2.0</a></td><td>An implementation of Adler-32, supporting rolling checksum operation</td></tr><tr><td><a class="package-name" href="https://www.stackage.org/nightly-2020-05-19/package/advent-of-code-api-0.2.7.0">advent-of-code-api-0.2.7.0</a></td><td>Advent of Code REST API bindings and servant API</td></tr><tr><td><a class="package-name" href="https://www.stackage.org/nightly-2020-05-19/package/aeson-1.4.7.1">aeson-1.4.7.1</a></td><td>Fast JSON parsing and encoding</td></tr><tr><td><a class="package-name" href="https://www.stackage.org/nightly-2020-05-19/package/aeson-attoparsec-0.0.0">aeson-attoparsec-0.0.0</a></td><td>Embed an Attoparsec text parser into an Aeson parser</td></tr><tr><td><a class="package-name" href="https://www.stackage.org/nightly-2020-05-19/package/aeson-better-errors-0.9.1.0">aeson-better-errors-0.9.1.0</a></td><td>Better error messages when decoding JSON values</td></tr><tr><td><a class="package-name" href="https://www.stackage.org/nightly-2020-05-19/package/aeson-casing-0.2.0.0">aeson-casing-0.2.0.0</a></td><td>Tools to change the formatting of field names in Aeson

Similar from any browser:

stackage

I've tried to capture screenshot with https://www.screenshotmachine.com, but it shows 2494, I believe the cause of this is CDN caching which is different depending for different regions.

Regardless, the original issue is that the page with incomplete data is generated.

juhp commented 4 years ago

@snoyberg maybe you can have a look at this when you find time, thanks

snoyberg commented 4 years ago

I think your message earlier is completely correct @juhp: using the web interface for this is the wrong thing to do. Taking the YAML files is the correct approach.

The CRON job that uploads the information is probably inserting packages one at a time into the database. Could that be improved? Sure, I guess, and I'd be happy to receive a PR. But it's not something I'd want to spend time on myself.

juhp commented 4 years ago

(Sure, I was about to suggest moving this to https://github.com/fpco/stackage-server/issues)

snoyberg commented 4 years ago

Honestly, I'd probably close it as WONTFIX.

AMDmi3 commented 4 years ago

As I've said, there's no easy way to get current yajl file - downloading repository either with git or as a snapshot requires too much traffic and time for the purpose, and there's no way to download latest yajl as a blob as the path cannot be predicted and there's no symlink to the latest snapshot.

Summarizing, either fixing html generation or adding symlinks to the latest lts and nighlty snapshots in the repository would do. Otherwise I'll have to drop support for stackage in Repology.

snoyberg commented 4 years ago

The path can definitely be predicted. Nightly for 2020-05-19 would, for example, be https://github.com/commercialhaskell/stackage-snapshots/blob/master/nightly/2020/5/19.yaml. The file https://s3.amazonaws.com/haddock.stackage.org/snapshots.json provides information on the latest nightly and minor release for each LTS major release.

AMDmi3 commented 4 years ago

The path can definitely be predicted

It cannot, as there can be no snapshot for the current date yet, or at all, or the latest snapshot could be for the next day depending on the timezone. lts is not tied to a date at all.

The file https://s3.amazonaws.com/haddock.stackage.org/snapshots.json provides information on the latest nightly and minor release for each LTS major release.

This is better, still it would require writing a custom fetcher which looks into two likely not synchronized locations for such a simple thing as downloading a single file. Would adding symlinks in the repo be too hard? E.g. lts/latest15/13.yaml. I've already checked, it would allow latest snapshot to be downloaded as https://github.com/commercialhaskell/stackage-snapshots/blob/master/lts/latest. May benefit other consumers as well.

snoyberg commented 4 years ago

I don't want to add the symlinks, since that will change the repo from a write-only repo to one with modifications.

I think what I've provided here is pretty straightforward: a JSON file with a clear mapping to a URL with YAML files. I'm not even sure how the symlink approach would be easier to work with than that.

AMDmi3 commented 4 years ago

I don't want to add the symlinks, since that will change the repo from a write-only repo to one with modifications.

Which is the actual problem with that?

I think what I've provided here is pretty straightforward: a JSON file with a clear mapping to a URL with YAML files. I'm not even sure how the symlink approach would be easier to work with than that.

Because with symlink, I'll be able to download latest snapshot as a regular file. Otherwise I'll have to write custom script which first fetches json, parses it to determine snapshot name, then only then fetches the snapshot.

juhp commented 4 years ago

A lot of time passed...

If you prefer not to parse the json you could check the redirect from stackage.org/nightly (eg on command-line something like curl -I -L https://www.stackage.org/nightly| grep location:) and then download that location.

A git symlink would also require you to download 2 files I believe (the symlink file and the actual yaml).

AMDmi3 commented 4 years ago

If you prefer not to parse the json

I'd prefer to parse JSON, but I'd also prefer to fetch it with a single request, without having to write custom code to determine which JSON to fetch.

you could check the redirect from stackage.org/nightly (eg on command-line something like curl -I -L https://www.stackage.org/nightly| grep location:) and then download that location.

What is this supposed to achieve? I just use HTTP client which follows redirects.