Open emanuil-tolev opened 8 years ago
We're currently running our scraper over this section of the fulltext XML so we will catch whatever URLs we have recorded in our licence dataset. I'm not sure adding this ability will improve detection by much, but it's worth mentioning as probably the only "feature" of the old system that is not yet present in the meteor one.
I should also say, obviously we can't just put "cc-by" and "cc0" 3-5 char strings in the scraper, so this would need to be implemented via specific code that looks at the EPMC fulltext XML, similarly to how the Python app does it.
me to review licence spreadsheet and see how this relates
The compliance.cottagelabs.com Python version first looks into the licence tag in EPMC's fulltext XML, looks at the attribute "licence-type" and translates that to our set of licences.
It uses this list of licence "types": https://github.com/CottageLabs/oacwellcome/blob/master/service/licences.py#L33
(Full list of translations after make_variation_map has run: https://gist.github.com/emanuil-tolev/4caeba3e5b599a0faa762ba55b4dc66d .)
And uses that to do a simple translation, i.e. "cc-by nc" becomes "cc-by-nc", "cc by" becomes "cc-by" etc.
We're currently running our scraper over this section of the fulltext XML so we will catch whatever URLs we have recorded in our licence dataset. I'm not sure adding this ability will improve detection by much, but it's worth mentioning as probably the only "feature" of the old system that is not yet present in the meteor one.
Wellcome have been given the go-ahead to test out wellcome.test.cottagelabs.com so if the change is made it'd only improve accuracy to new results.