13 add urlchecks to the tests

j-i-l commented 1 month ago

Closes #13

j-i-l commented 1 month ago

aaand we get again a 403 on a url that is digested just fine in a browser: https://github.com/furrer-lab/abn/actions/runs/8701967813/job/23865029772#step:5:60

The url in question:

https://onlinelibrary.wiley.com/doi/10.1111/tbed.13672

With curl -I I also get HTTP/2 403 but in a browser the url is valid. The url for citation is https://doi.org/10.1111/tbed.13672 which gives a 302 when requesting with curl -I, which would be picked up by urlcheck also.

My conclusion is that we cannot really use urlcheck as is. We either should ignore redirects from the domain doi.org or ignore 403's. The former seems the cleanest solution, however, CRAN will not like our doi.org/... urls then.

@matteodelucchi what do you think about this?

matteodelucchi commented 1 month ago

aaand we get again a 403 on a url that is digested just fine in a browser: https://github.com/furrer-lab/abn/actions/runs/8701967813/job/23865029772#step:5:60

The url in question:

https://onlinelibrary.wiley.com/doi/10.1111/tbed.13672

With curl -I I also get HTTP/2 403 but in a browser the url is valid. The url for citation is https://doi.org/10.1111/tbed.13672 which gives a 302 when requesting with curl -I, which would be picked up by urlcheck also.

This seems to be a known and open issue with urlchecker (see here).

My conclusion is that we cannot really use urlcheck as is. We either should ignore redirects from the domain doi.org or ignore 403's. The former seems the cleanest solution, however, CRAN will not like our doi.org/... urls then.

@matteodelucchi what do you think about this?

I agree that ignoring redirects from doi.org is the cleaner option, but it has drawbacks for CRAN. Given that this issue is clearly on the side of urlchecker, I'd go for ignoring 403s until this has been fixed in urlchecker. We still get the printout of the urlchecker results and should have a manual look at it. I'd add a note about this to the contribution file.

matteodelucchi commented 1 month ago

aaand we get again a 403 on a url that is digested just fine in a browser: https://github.com/furrer-lab/abn/actions/runs/8701967813/job/23865029772#step:5:60 The url in question: https://onlinelibrary.wiley.com/doi/10.1111/tbed.13672 With curl -I I also get HTTP/2 403 but in a browser the url is valid. The url for citation is https://doi.org/10.1111/tbed.13672 which gives a 302 when requesting with curl -I, which would be picked up by urlcheck also.

This seems to be a known and open issue with urlchecker (see here).

My conclusion is that we cannot really use urlcheck as is. We either should ignore redirects from the domain doi.org or ignore 403's. The former seems the cleanest solution, however, CRAN will not like our doi.org/... urls then. @matteodelucchi what do you think about this?

I agree that ignoring redirects from doi.org is the cleaner option, but it has drawbacks for CRAN. Given that this issue is clearly on the side of urlchecker, I'd go for ignoring 403s until this has been fixed in urlchecker. We still get the printout of the urlchecker results and should have a manual look at it. I'd add a note about this to the contribution file.

Decision:

Don't ignore 403. Filter those with error, if they start with https://doi.org accept as no error -> no escalation.

j-i-l commented 1 month ago

aaand we get again a 403 on a url that is digested just fine in a browser: https://github.com/furrer-lab/abn/actions/runs/8701967813/job/23865029772#step:5:60 The url in question: https://onlinelibrary.wiley.com/doi/10.1111/tbed.13672 With curl -I I also get HTTP/2 403 but in a browser the url is valid. The url for citation is https://doi.org/10.1111/tbed.13672 which gives a 302 when requesting with curl -I, which would be picked up by urlcheck also.

This seems to be a known and open issue with urlchecker (see here).

My conclusion is that we cannot really use urlcheck as is. We either should ignore redirects from the domain doi.org or ignore 403's. The former seems the cleanest solution, however, CRAN will not like our doi.org/... urls then. @matteodelucchi what do you think about this?

I agree that ignoring redirects from doi.org is the cleaner option, but it has drawbacks for CRAN. Given that this issue is clearly on the side of urlchecker, I'd go for ignoring 403s until this has been fixed in urlchecker. We still get the printout of the urlchecker results and should have a manual look at it. I'd add a note about this to the contribution file.

Decision:
* Don't ignore 403. Filter those with error, if they start with `https://doi.org` accept as no error -> no escalation.

If we replace the 403-causing url with the doi.org/... url of the paper in question we also get 403 back. curl tells us that we have a 302 (temp. redirect) which is fine for the urlchecker. the 403 then comes from the redirected url agian, I guess.

Basically, we get a 403 for a page that does exist. The reason could be in the header of the request, e.g. info in the user agent. But in any case this does not seem practical as for the url in question there seems currently no way to avoid the error.

To address this problem we could either add a specific exception (which seems a tedious approach), or allow the test to fail on some form.

If we allow the test to fail, we can either use continue-on-failure (or similar) for the job. This will lead to a :x: in the workflow, but it will terminate nevertheless, or we simply do not raise an exception when running url checks.

I'll go for the latter version without raising an exception for now. The url check then becomes just a source of information, but for now this seems to be the most practical approach.

matteodelucchi commented 1 month ago

I just learned that DOIs are not URLs (see here)... I will change the DOIs in the documentation according to this template asap.

j-i-l commented 1 month ago

I just learned that DOIs are not URLs (see here)... I will change the DOIs in the documentation according to this template asap.

Makes sense to use the proper \doi format, however, in the end, if one requests content via https then the requested address is just a URL, the DOI can be part of the URI then, but the request will eventually lead to the https://doi.org/... and for our 403 problem, we will end up after a 302 from doi.org/... again on the url that gives 403 when using curl (or similar).

Using \doi might fix our issue if urlchecker does not (or differently) evaluate \doi{} entries.

matteodelucchi commented 1 month ago

I just learned that DOIs are not URLs (see here)... I will change the DOIs in the documentation according to this template asap.

Makes sense to use the proper \doi format, however, in the end, if one requests content via https then the requested address is just a URL, the DOI can be part of the URI then, but the request will eventually lead to the https://doi.org/... and for our 403 problem, we will end up after a 302 from doi.org/... again on the url that gives 403 when using curl (or similar).

Using \doi might fix our issue if urlchecker does not (or differently) evaluate \doi{} entries. Hmm... I see...

Additionally, I just realised, that the URLs that cause the problem are in the README.md where \url{} doesn't work (it is a roxygen syntax used only for function documentation). Let's keep as is...

furrer-lab / abn

13 add urlchecks to the tests #80