aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.13k stars 551 forks source link

Cross check that all licenses on https://www.openhub.net/licenses are properly detected #852

Open pombredanne opened 6 years ago

pombredanne commented 6 years ago

See https://www.openhub.net/licenses Several are likely to be variants of BSD and MIT or existing license and will be detected as such. Some oddities may require a new license addition

yudhik11 commented 6 years ago

can I take up this issue?

pombredanne commented 6 years ago

@yudhik11 That would be very nice of you. Thanks!

yudhik11 commented 6 years ago

@pombredanne do I need to scrape those license online and check whether they are getting detected OR in the repo there are many licenses I should test them from there.And also the license data is present in src/licensedcode/data/licenses/ and tests/licensedcode/data/licenses/ and to me the thing that confused me was there are more licenses in tests then in src data set. So with LICENSE data should I use.

pombredanne commented 6 years ago

@yudhik11 so the stepos would be:

  1. collect the texts online
  2. make license tests from each of these in tests/licensedcode/data/licenses/ ....
  3. also test by hand the detection
  4. if we do not have 100% detection then either: 4.1 add new rules 4.2 add a new license if this is a new license

And also the license data is present in src/licensedcode/data/licenses/ and tests/licensedcode/data/licenses/ and to me the thing that confused me was there are more licenses in tests then in src data set.

yudhik11 commented 6 years ago

@pombredanne what do you meant by [[making license test from each of these in (Addr)]] and also what is testing by hand the detection

pombredanne commented 6 years ago
yudhik11 commented 6 years ago

@pombredanne it will be nice if you can assign me a small work or any small issue because right now i am trying this from past two weeks and I keep on moving in circle and cannot proceed.

Anything will be much appreciated.

pombredanne commented 6 years ago

@yudhik11 ok :) Let me break thing down in smaller issue that would be bite sized!

yudhik11 commented 6 years ago

@pombredanne I am still waiting for your bite-sized issues :)

pombredanne commented 6 years ago

@yudhik11 Can you start by creating a JSON file that lists all the urls for the license page links and names in the 18 pages at https://www.openhub.net/licenses This should look like:

[
    {"openhub_url": "https://www.openhub.net/licenses/Artistic_License_2_0", "name": "Artistic License 2.0"},
    {"openhub_url": "https://www.openhub.net/licenses/Beerware", "name": "Beerware"},
   .....
]
pombredanne commented 6 years ago

Best is to write a small script for this. Beautiful soup can help there.

yudhik11 commented 6 years ago

req.txt screenshot from 2017-12-19 14-56-58 @pombredanne here is the script which I wrote and you can see "req.txt" this is how the output looks like. Next what should I do

yudhik11 commented 6 years ago

@pombredanne can you help me more so that i can keep going

pombredanne commented 6 years ago

Can you paste rather than attach the script?

pombredanne commented 6 years ago

What results do you have out of this?

pombredanne commented 6 years ago

Ok, I see, accumulate results in some list or iterable and use the json module. Do not create json yourself. And best is to create a PR with your script in the etc/scripts directory

yudhik11 commented 6 years ago

import urllib2 import requests from bs4 import BeautifulSoup for i in range (18,19): wiki="https://www.openhub.net/licenses?page="+str(i) page=urllib2.urlopen(wiki) soup=BeautifulSoup(page,'html.parser') all_lic=soup.find(id="license") all_as=all_lic.select("table a") all_a=[pt.get_text() for pt in all_as] for x in all_a: print(x.encode('utf-8'))

pombredanne commented 6 years ago

so we can review the code alright

yudhik11 commented 6 years ago

ok will do it right now