ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

Test hasPDF from EPMC #72

Closed tarrow closed 8 years ago

tarrow commented 8 years ago

I think this solves the PDF issues PMR had in #58 . Not actually a timeout though.

blahah commented 8 years ago

Thanks for the PR!

You've identified the problem correctly, and your solution is along the right lines. The problem is that fullTextUrlList sometimes doesn't exist, and if we try to index into it, we crash.

Some comments:

Testing for hasPDF is a good idea in general, as it saves us looping through the URLs only to find there's no PDF. But to avoid ever trying to access a key that doesn't exist we should also directly test for the existence of that key. e.g.:

if (result.fullTextUrlList) {
  ...
}

or more likely...

if (!(result.fullTextUrlList) {
   ... warn and return
}

And now we have several points in this function where we might log a message about there being no PDF and then return - using the DRY principle we should write the code for this once, e.g.:

var noPDF = function(id) {
  log.warn('no pdf for id ', id);
  return null;
}

...

if (condition1) { return noPDF(id); }

...

if (condition2) { return noPDF(id); }

If you could incorporate those two changes we can get this merged :). Thanks again!

blahah commented 8 years ago

Ah, and we should check whether the same problem occurs anywhere else in the code - i.e. do we require the result to contain a certain key? If so we should always check it exists before accessing it, and cleanly handle a situation where it doesn't exist.

tarrow commented 8 years ago

Hopefully I have now handled all the cases where there is no fulltext URL.

Again; I need a style (and quality) check though :). Thanks for all the detail in the last one!

blahah commented 8 years ago

This looks spot on - well done and thanks :)