CottageLabs / LanternPM

Lantern meta repository for product management
1 stars 0 forks source link

Possible small addition to EPMC XML licence detection ability #56

Open emanuil-tolev opened 8 years ago

emanuil-tolev commented 8 years ago

The compliance.cottagelabs.com Python version first looks into the licence tag in EPMC's fulltext XML, looks at the attribute "licence-type" and translates that to our set of licences.

It uses this list of licence "types": https://github.com/CottageLabs/oacwellcome/blob/master/service/licences.py#L33

# the possible types we'll see in EPMC, and the canonical type they map to
types = {}
types.update(make_variation_map(["cc"], "cc"))  # I'm not sure we should use this one, but the rest, yes.
types.update(make_variation_map(["cc", "by"], "cc-by"))
types.update(make_variation_map(["cc", "by", "sa"], "cc-by-sa"))
types.update(make_variation_map(["cc", "by", "nd"], "cc-by-nd"))
types.update(make_variation_map(["cc", "by", "nc"], "cc-by-nc"))
types.update(make_variation_map(["cc", "by", "nc", "nd"], "cc-by-nc-nd"))
types.update(make_variation_map(["cc", "by", "nc", "sa"], "cc-by-nc-sa"))
types.update(make_variation_map(["cc0"], "cc0"))

# some types which are regularly mis-represented
types.update(make_variation_map(["cc", "nc"], "cc-by-nc"))
types.update(make_variation_map(["cc", "nc", "nd"], "cc-by-nc-nd"))

(Full list of translations after make_variation_map has run: https://gist.github.com/emanuil-tolev/4caeba3e5b599a0faa762ba55b4dc66d .)

And uses that to do a simple translation, i.e. "cc-by nc" becomes "cc-by-nc", "cc by" becomes "cc-by" etc.

We're currently running our scraper over this section of the fulltext XML so we will catch whatever URLs we have recorded in our licence dataset. I'm not sure adding this ability will improve detection by much, but it's worth mentioning as probably the only "feature" of the old system that is not yet present in the meteor one.

Wellcome have been given the go-ahead to test out wellcome.test.cottagelabs.com so if the change is made it'd only improve accuracy to new results.

emanuil-tolev commented 8 years ago

We're currently running our scraper over this section of the fulltext XML so we will catch whatever URLs we have recorded in our licence dataset. I'm not sure adding this ability will improve detection by much, but it's worth mentioning as probably the only "feature" of the old system that is not yet present in the meteor one.

I should also say, obviously we can't just put "cc-by" and "cc0" 3-5 char strings in the scraper, so this would need to be implemented via specific code that looks at the EPMC fulltext XML, similarly to how the Python app does it.

richard-jones commented 7 years ago

me to review licence spreadsheet and see how this relates