CottageLabs / LanternPM

Lantern meta repository for product management
1 stars 0 forks source link

Article licence not picked up from XML (or HTML in a large number of cases that appear to have the same root cause) #114

Closed richard-jones closed 7 years ago

richard-jones commented 8 years ago

In this article PMID 23225165 the epmc licence information was found in the html, but the same information appears to be available in the XML which I think we check first so I was wondering why it didn’t find it in the XML?

@emanuil-tolev to review and pass to @markmacgillivray if work required

emanuil-tolev commented 8 years ago

Oh gosh, this is usually a pain to answer.

Full-text XML URL: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3955487/fullTextXML HTML URL: http://europepmc.org/articles/PMC3955487 (note to self, json api url: http://www.ebi.ac.uk/europepmc/webservices/rest/search?query=PMC3955487&resulttype=core&format=json )

The string "This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited." is indeed present in both. However, in the XML it's not in the permissions section. It should still be picked up by the "epmc_xml_outside_permissions" part of the detection procedure and should not go to HTML detection. Investigating that now.

emanuil-tolev commented 8 years ago

There might be a problem with the fulltext XML licence checking, though I haven't established that a/ for sure; b/ consistently (it has worked in the past, I've seen examples).

See this log output:

I20160823-20:55:10.920(1)? qcGPDuXNuxuyXfR2x
I20160823-20:55:10.924(1)? http://www.ebi.ac.uk/europepmc/webservices/rest/search?query=PMCID:PMC3955487&resulttype=core&format=json
I20160823-20:55:11.296(1)? Fulltext length: 29064
I20160823-20:55:11.296(1)? Fulltext copy recorded in /win/d/cottagelabs/oacwellcome/23-aug-2016/PMC3955487.fulltext.xml
I20160823-20:55:11.297(1)? Academic licence doing find licences on content of length 29064
I20160823-20:55:11.298(1)? Getting licences data from google spreadsheet for academic licence calculator
I20160823-20:55:12.456(1)? 235 strings available for licence checking
I20160823-20:55:12.457(1)? Academic licence reduced content length to 71
I20160823-20:55:12.474(1)? PMC3955487 licinperms XML check: 
I20160823-20:55:12.475(1)? { licence: 'unknown', retrievable: true }
I20160823-20:55:12.476(1)? PMC3955487 licinperms XML check failed
I20160823-20:55:12.477(1)? Academic licence doing find licences on content of length 29064
I20160823-20:55:12.477(1)? licences data retrieved from local copy .licences.json
I20160823-20:55:12.481(1)? 235 strings available for licence checking
I20160823-20:55:12.482(1)? Academic licence reduced content length to 29064
I20160823-20:55:12.513(1)? PMC3955487 licanywhere XML check:
I20160823-20:55:12.514(1)? { licence: 'unknown', retrievable: true }
I20160823-20:55:12.514(1)? PMC3955487 licanywhere XML check failed
I20160823-20:55:12.514(1)? PMC3955487 licinperms XML check discovered non-standard-licence
I20160823-20:55:12.514(1)? PMC3955487 no fulltext XML, trying EPMC HTML
I20160823-20:55:12.515(1)? Getting http://europepmc.org/articles/PMC3955487 which resolved to http://europepmc.org/articles/PMC3955487 for licence check
I20160823-20:55:13.259(1)? Academic licence doing find licences on content of length 91712
I20160823-20:55:13.260(1)? licences data retrieved from local copy .licences.json
I20160823-20:55:13.264(1)? 235 strings available for licence checking
I20160823-20:55:13.264(1)? Academic licence reduced content length to 91712
I20160823-20:55:13.343(1)? academic licence matched on text underthetermsofthecreativecommonsattributionlicense
I20160823-20:55:13.344(1)? PMC3955487 licsplash HTML check on http://europepmc.org/articles/PMC3955487
I20160823-20:55:13.344(1)? { url: 'http://europepmc.org/articles/PMC3955487',
I20160823-20:55:13.344(1)?   resolved: 'http://europepmc.org/articles/PMC3955487',
I20160823-20:55:13.344(1)?   licence: 'cc-by',
I20160823-20:55:13.344(1)?   retrievable: true,
I20160823-20:55:13.345(1)?   matched: 'underthetermsofthecreativecommonsattributionlicense',
I20160823-20:55:13.347(1)?   matcher: 'under the terms of the Creative Commons Attribution License' }
I20160823-20:55:13.347(1)? PMC3955487 licsplash HTML check success cc-by

As you can see, the licence is found via the HTML, and the "licanywhere" XML check has failed.

However, I saved the fulltext XML available to the licence checker by inserting this code:

    debugxml = '/win/d/cottagelabs/oacwellcome/23-aug-2016/' + pmcid + '.fulltext.xml';
    console.log('Fulltext copy recorded in ' + debugxml);
    var fs = Meteor.npmRequire('fs');
    fs.writeFileSync(debugxml, fulltext);

and I can find the "matcher" string detailed in the last lines of the log: "under the terms of the Creative Commons Attribution License". I find that just using vim. Here is the XML file I obtain:

PMC3955487.fulltext.xml.zip

Intuitively, the licanywhere (anywhere in the FT XML) licence check should be succeeding here and we should not be going to HTML, so there might well be a problem. Any ideas welcome - I'll try to debug exactly why it's happening, but might not get very far today or early tomorrow.

emanuil-tolev commented 8 years ago

The problem must be somewhere around the text produced from the fulltext XML or the subsequent matching.

emanuil-tolev commented 8 years ago

Yeah alright, so the XML fulltext from EPMC has this:

This article is distributed under the terms of the Creative Commons Attribution License

But the resulting normalised text string used for matching has none of that. Obviously something's going wrong there.

markmacgillivray commented 8 years ago

It is odd that there does not appear to be a section on that XML, and yet the first XML check DOES reduce the length of the content to 71, and also as you can see from your output, it is saying it DOES find a

section in the XML, saying discovered non-standard licence. So what content is it operating on at that point, and where is it finding ? However if I look directly at the XML source on EPMC, I can see that file DOES have a permisions section: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3955487/fullTextXML, and it is a non-standard licence. Are you somehow running on old data? On Tue, Aug 23, 2016 at 9:46 PM, Emanuil Tolev notifications@github.com wrote: > Furthermore, it appears that the xml2txt conversion, lowercasing and > whitespace stripping is happening successfully. I can still find " > underthetermsofthecreativecommonsattributionlicense" in the resulting > text string, so the licence detection still appears to be in the wrong to > me - it should be finding this statement in the txt produced from the > fulltext XML. > > Roll that back - not true. I mistook the text produced from the HTML page > for the text produced from the XML. The problem must be somewhere around > the text produced from the fulltext XML or the subsequent matching. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > https://github.com/CottageLabs/LanternPM/issues/114#issuecomment-241872938, > or mute the thread > https://github.com/notifications/unsubscribe-auth/AAuXCB4ZJ-os3vBYf5SUCP1LT2_Aq34qks5qi1whgaJpZM4JpoV8 > .
markmacgillivray commented 8 years ago

I don't recall writing that maybe_licence part, perhaps you added it. But it seems the code could find a section but the licence was non-standard so may have been set to unknown, then set to maybe_licence, but that does not stop an html check from taking place. Then it does find the licence in the html.

So your problem could be you ARE finding a licence in the section, but it is non-standard, then not looking for one elsewhere. I do not know why the file you sent does not have a section in it though...

On Tue, Aug 23, 2016 at 9:49 PM, Mark MacGillivray mark@cottagelabs.com wrote:

It is odd that there does not appear to be a section on that XML, and yet the first XML check DOES reduce the length of the content to 71, and also as you can see from your output, it is saying it DOES find a

section in the XML, saying discovered non-standard licence. So what content is it operating on at that point, and where is it finding ? However if I look directly at the XML source on EPMC, I can see that file DOES have a permisions section: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3955487/fullTextXML, and it is a non-standard licence. Are you somehow running on old data? On Tue, Aug 23, 2016 at 9:46 PM, Emanuil Tolev notifications@github.com wrote: > Furthermore, it appears that the xml2txt conversion, lowercasing and > whitespace stripping is happening successfully. I can still find > "underthetermsofthecreativecommonsattributionlicense" in the resulting > text string, so the licence detection still appears to be in the wrong to > me - it should be finding this statement in the txt produced from the > fulltext XML. > > Roll that back - not true. I mistook the text produced from the HTML page > for the text produced from the XML. The problem must be somewhere around > the text produced from the fulltext XML or the subsequent matching. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > https://github.com/CottageLabs/LanternPM/issues/114#issuecomment-241872938, > or mute the thread > https://github.com/notifications/unsubscribe-auth/AAuXCB4ZJ-os3vBYf5SUCP1LT2_Aq34qks5qi1whgaJpZM4JpoV8 > .
emanuil-tolev commented 8 years ago

The algorithm in the licence checker runs as follows:

  1. <permissions> check. This fails for this article.
  2. Anywhere in XML check. This fails for this article.
  3. Check whether a permissions section exists at all. If so, and step 1 did not match a licence, this means that the licence could be a non-standard licence, provided that there is no further information in the HTML. That's how the old tool operated. Essentially this just checks for the existence of a permissions section. EDIT: I think what you've seen and are referring to above is this step. This is not the positive identification of a licence and could be overwritten by the next step.
  4. EPMC HTML check.

The problem is that step 2 is failing, where it very clearly should not be. I'm trying to pin down the problem but might have to continue tomorrow. At the moment I can clearly see the licence in the fulltext XML (which step 2 should then detect), but once the Meteor "html-to-text" package is done with it, the licence terms are gone.

emanuil-tolev commented 8 years ago

Here's the problem: line 91 of licence.js.

Current code:

text = CLapi.internals.convert.xml2txt(undefined,content).toLowerCase().replace(/[^a-z0-9]/g,'');

If I comment this out and instead put

text = content.toLowerCase().replace(/[^a-z0-9]/g,'');

the licence is found in step 2, the "licanywhere" check. The conversion from fulltext XML to a normalised string ready for licence matching discards a part of the content, including the licence terms in this case. The first step in this normalisation is to call CLapi.internals.convert.xml2txt which also discards the content we need, and itself calls the html-to-text NPM package. Looks like this package might work well with HTML, but maybe not so well with XML?

emanuil-tolev commented 8 years ago

Here's a minimal example to demonstrate.

We start with some fulltext XML, has the licence terms in it (look for "under the terms"):

fulltext.xml.in-findlics5384.zip

For step 2, the check for a licence anywhere in the XML fulltext, this XML is run through this line of code:

text = CLapi.internals.convert.xml2txt(undefined,content).toLowerCase().replace(/[^a-z0-9]/g,'');

The result is this file:

fulltext.txt-inside-findlics.9388.zip

which does not contain "undertheterms". But if I remove the call to CLapi.internals.convert.xml2txt with a simple lowercase and regex replace (which is already on that line), then the resulting text does contain "undertheterms" and the licence is identified correctly at the right stage.

The files/strings above are output directly from the findlicences function, just before matching is performed. To ensure I'm not confusing the output of different runs of findlicences (like for permissions check, or HTML check), a random 4-digit number is appended to filenames, and the filenames corresponding to different parts of the run are output in the log so I know what to check/attach here.

emanuil-tolev commented 8 years ago

I do not know why the file you sent does not have a section in it though...

This does happen if the publisher has put the licence in the wrong place in the XML. That's why we also check the entire XML (step 2) after checking <permissions> (step 1) - for cases like this.

markmacgillivray commented 8 years ago

But I am saying the file you attached here, having saved it, does not seem to have a section, whereas when looking at the file on EBI it does.

On Tue, Aug 23, 2016 at 10:25 PM, Emanuil Tolev notifications@github.com wrote:

I do not know why the file you sent does not have a section in it though...

This does happen if the publisher has put the licence in the wrong place in the XML. That's why we also check the entire XML (step 2) after checking

(step 1) - for cases like this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CottageLabs/LanternPM/issues/114#issuecomment-241884150, or mute the thread https://github.com/notifications/unsubscribe-auth/AAuXCHOVkZ-2QF_NHkKRdXc7Szwkfl2aks5qi2U4gaJpZM4JpoV8 .
markmacgillivray commented 8 years ago

convert.xml2txt could be changed to convert.html2txt, and xml2txt could instead wrap the file2txt, which uses textract. That may work better for xml

emanuil-tolev commented 8 years ago

convert.xml2txt could be changed to convert.html2txt, and xml2txt could instead wrap the file2txt, which uses textract. That may work better for xml

Yeah, looks like it. Is it worth you trying that if you've any time today? The criteria is simple enough - get "epmc_xml_outside_permissions" source for a "cc-by" licence for PMID 23225165 .

Atm I'm looking at the titles bug that got raised yesterday - that sort of looks like it might have had a fix deployed to develop already.

markmacgillivray commented 8 years ago

The licence is now picked up from the xml first, but as I mentioned before, there IS a permissions section in the xml, and it contains a non-standard licence. So the result is epmx_xml_permissions identifying a non_standard_licence.

Here is an example:

https://dev.api.cottagelabs.com/service/lantern/EWxWx8vnPcC72w5d5/results

It is possible to ALSO check outside the permissions field when it is present, but that would be a different process to what we currently do. Do you know which it is that Wellcome expect? e.g. take any licence IN permissions over any licence NOT in permissions? And is it clear whether changing this now would be a change from old functionality, or a replication of old functionality?

@richard-jones @emanuil-tolev pinging you both to make sure you see this message - this fix is not on live yet, just dev. I need clarification on the above please.

markmacgillivray commented 8 years ago

This is now on live to the extent that it does properly parse the xml, and finds the non-standard licence in the permissions area.

emanuil-tolev commented 8 years ago

So as I said above, we should try to detect licences from our licence list in <permissions> first (1). Then (2) we should attempt to run the licence list over the rest of the XML. Then (3) we should run the licence list over the EPMC HTML.

Only then (4) do we look at whether there is a <permissions> section with a licence that we could not detect. This is a much less specific assertion than a positive ID of "the licence is this because this is the string we found". That's why it should come last in the order of preference. This was implemented with the maybe_licence var.

The article's XML is: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3955487/fullTextXML

It states two licences:

The latter is indeed a non standard licence, but the former is not. Under the sequence of steps I've numbered above, the CLapi.internals.use.europepmc.licence function should detect a "cc-by" EPMC Licence with source "epmc_outside_xml_permissions". This is because step 2 (one of 3 attempts at positive identification through a text string) runs before step 4 (identification through "there is something in <permissions> that we don't recognise, so it must be non standard").

As far as I can see the current code in develop reflects those four numbered steps.

When I investigated this, step 2 was failing and step 3 was succeeding instead. Step 2 was failing because the XML was being truncated by its conversion to text, for some reason.

This is now on live to the extent that it does properly parse the xml, and finds the non-standard licence in the permissions area.

You've probably guessed as much by the time you read here, but this should only be the last resort - it should detect the cc-by outside the permissions tag first. Or at least detect the cc-by in the HTML (as it did just before your latest changes).

emanuil-tolev commented 8 years ago

Removing from Wellcome milestone as this is now a discussion of how to improve the licence detection in Lantern (and probably our common understanding of it), given a specific example article.

emanuil-tolev commented 8 years ago

This appears to be affecting a large number of articles run by Wellcome (and likely by all Lantern users) over the past 7-10 days. We initially attributed the large amount of non-standard-licence and unknowns to EPMC downtime, but it turns out the xml2txt function we discussed above is responsible. 1-line fix is to stop using it and use just plain JS lowercase and regex char stripping, now in a PR.

richard-jones commented 7 years ago

@markmacgillivray @emanuil-tolev - what's the status of this issue? Is it fixed, or is there something still to resolve here?

markmacgillivray commented 7 years ago

Done