Open pombredanne opened 6 years ago
can I take up this issue?
@yudhik11 That would be very nice of you. Thanks!
@pombredanne do I need to scrape those license online and check whether they are getting detected OR in the repo there are many licenses I should test them from there.And also the license data is present in src/licensedcode/data/licenses/ and tests/licensedcode/data/licenses/ and to me the thing that confused me was there are more licenses in tests then in src data set. So with LICENSE data should I use.
@yudhik11 so the stepos would be:
tests/licensedcode/data/licenses/
.... And also the license data is present in src/licensedcode/data/licenses/ and tests/licensedcode/data/licenses/ and to me the thing that confused me was there are more licenses in tests then in src data set.
src/licensedcode/data/licenses/
contains the actual real license files used as master data and for detectiontests/licensedcode/data/licenses/
contains test files that we use to verify that the detection works correctly and are used only when running the py.test
test suite@pombredanne what do you meant by [[making license test from each of these in (Addr)]] and also what is testing by hand the detection
@pombredanne it will be nice if you can assign me a small work or any small issue because right now i am trying this from past two weeks and I keep on moving in circle and cannot proceed.
Anything will be much appreciated.
@yudhik11 ok :) Let me break thing down in smaller issue that would be bite sized!
@pombredanne I am still waiting for your bite-sized issues :)
@yudhik11 Can you start by creating a JSON file that lists all the urls for the license page links and names in the 18 pages at https://www.openhub.net/licenses This should look like:
[
{"openhub_url": "https://www.openhub.net/licenses/Artistic_License_2_0", "name": "Artistic License 2.0"},
{"openhub_url": "https://www.openhub.net/licenses/Beerware", "name": "Beerware"},
.....
]
Best is to write a small script for this. Beautiful soup can help there.
req.txt @pombredanne here is the script which I wrote and you can see "req.txt" this is how the output looks like. Next what should I do
@pombredanne can you help me more so that i can keep going
Can you paste rather than attach the script?
What results do you have out of this?
Ok, I see, accumulate results in some list or iterable and use the json module. Do not create json yourself.
And best is to create a PR with your script in the etc/scripts
directory
import urllib2 import requests from bs4 import BeautifulSoup for i in range (18,19): wiki="https://www.openhub.net/licenses?page="+str(i) page=urllib2.urlopen(wiki) soup=BeautifulSoup(page,'html.parser') all_lic=soup.find(id="license") all_as=all_lic.select("table a") all_a=[pt.get_text() for pt in all_as] for x in all_a: print(x.encode('utf-8'))
so we can review the code alright
ok will do it right now
See https://www.openhub.net/licenses Several are likely to be variants of BSD and MIT or existing license and will be detected as such. Some oddities may require a new license addition