there has to be way to use a newer version of the license list - Githubissues

eellak / clio

Clio, a web-based system for maintaining (meta-)information on software components

https://clio.ellak.gr/

Other

7 stars 9 forks source link

there has to be way to use a newer version of the license list #39

Open zvr opened 5 years ago

zvr commented 5 years ago

The license list should be updated to the latest list published by SPDX , v3.4. https://spdx.org/licenses/

Chinmay-Gurjar commented 5 years ago

I have created a python script to extract data from spdx.org website and store it in a .csv fie , should I include the script too in my pull request, so that whenever the license list get updated , we can just run the script and update our license list ?

gopuvenkat commented 5 years ago

I have maintained my script as a public gist.

zvr commented 5 years ago

The upstream data to be used are in https://github.com/spdx/license-list-data

Chinmay-Gurjar commented 5 years ago

can we add https://github.com/spdx/license-list-data in our project and extract data from json file instead of a .csv file ? @zvr

zvr commented 5 years ago

That would only solve the issue of initial import of licenses. What would be a solution for updating the list when a new version is published?

Chinmay-Gurjar commented 5 years ago

Then we should directly extract data from https://spdx.org/licenses/ , not from the repostiory https://github.com/spdx/license-list-data.

zvr commented 5 years ago

?!? the website is generated from the data; the information is the same. My question is: you get the data and you use them to populate the database. What do you do when the new version of the license list is published (in both license-list-date repo and the spdx.org website) ?

Chinmay-Gurjar commented 5 years ago

The script by @gopuvenkat at https://gist.github.com/gopuvenkat/1c8b9f75d366c191f1ec4afffb84696f would do the thing with just amending the license-text attribute. @zvr but when I tried it, there were some formatting issues, which I will have to solve.

zvr commented 5 years ago

No, you don't understand. Ignore Gopu's script (which incorrectly uses the website instead of the data). You have some license data (from the repo or the website), and you populate the database with this info. Then you start using clio, add your data about components, etc. Then a new SPDX license list is published. What do you do?

You cannot re-run populate_license() again, since most of the licenses are already in the database... and you definitely do not want to delete everything and start from scratch again.

Chinmay-Gurjar commented 5 years ago

This solution may sound lame but, we can write a script that checks for the new licenses in the https://github.com/spdx/license-list-data. @zvr please share your thoughts if you have some other ideas.

shivanshuraj1333 commented 5 years ago

@zvr and @gopuvenkat The possible solution I can think of is using Hashing, please refer the following steps. 1). On initial clio startup populate database using csv file created from json file https://github.com/spdx/license-list-data/tree/master/json (currently csv file is generated from this url: https://spdx.org/licenses/) 2). Now we have to sync our csv file and json file, whenever a new commit is made in github repo maintaining json file (https://github.com/spdx/license-list-data/tree/master/json). Github API can be used to track commits. 3). Use hashing to update csv file (Append new entries and modify previous entries). 4). populate updated csv using a button (update button) or time-based job scheduler (cron job) In this way, the updating process is reduced to O(n) time complexity and unnecessarily changes in our data base is avoided. @zvr , @gopuvenkat please share your views.

Chinmay-Gurjar commented 5 years ago

There is one more easy and efficient way out. We could just write a script to clone the repository from https://github.com/spdx/license-list-data and just use "git diff" command to get added files and updated files and just append those files to csv file. This will be more efficient than the above proposed method because in the above method we'll be comparing each entry for hashing which will eventually be O(n*n) and only the writing part will be O(n). Please share your thoughts @zvr @gopuvenkat @shivanshu1333

shivanshuraj1333 commented 5 years ago

@Chinmay-Gurjar Thanks for your efforts! 1) There is no need to clone the complete repository, Gihub API can easily be used to track any new commit in the json file. 2) There might be some cases where few entries are removed and few entries are modified, so just appending changes will not sync csv file with json file of repository. 3) Reading will be in O(n) if we use hashing and writing will be in O(k), where n is total number of entries and k is modified entries. @Chinmay-Gurjar For further doubts you can contact on https://clio.zulipchat.com/#narrow/stream/121073-general/topic/GSoC2019 @zvr @gopuvenkat please review this possible solution, so that I can work on this issue and make a pull request before GSoC 2019 application period. Thanks!

zvr commented 5 years ago

A couple of points:

there is no need to track commits; SPDX license list releases happen every quarter or so
yes, the usual case is that licenses get added. However we also have modifications and deletions (deprecations)
No need to premature optimize something that will be run once every 3 months.

I think this ticket has evolved into how do we keep adding new license list versions; I'll edit the title to reflect that.

shivanshuraj1333 commented 5 years ago

@zvr and @gopuvenkat I am working on this issue and as a temporary solution I will add an Update button in existing clio page which will in background fetch and update data when clicked. Latter on I will add a job scheduler to this job automatically.

zvr commented 5 years ago

Forget the job scheduler; no one will ever want this to run automatically.

But it remains to be decided what to do with the modified license data... what do you propose?

shivanshuraj1333 commented 5 years ago

@zvr Each license has unique identity (say license name), I will use it as a key and will check all the other parameters. If any parameter is modified I will update it in clio's database. If there is no such existing key (i.e. a new entry) then I will just simply add it to clio's data base.

shivanshuraj1333 commented 5 years ago

Is update button on license page of clio is fine to accomplish this?

zvr commented 5 years ago

@shivanshu1333 and what about deleted license identifiers?

shivanshuraj1333 commented 5 years ago

@zvr I will maintain a dict in my script which hashes current data from Database along with two additional key "Boolean" and "PK"(primary key of database table entries). While reading json file, for every hit (i.e. licence is found in dict) it will update other fields if necessary and update Boolean field to "True". So when ever there is any deletion, Boolean field will remain "False" and we will delete it from our Dict. PK field will be used to track back and update entries to our Database.

zvr commented 5 years ago

So basically you will be using your own "database" (in the form of a dict) to store this information. This is not correct; what will happen when the program stops and starts again? (this info will be lost)

shivanshuraj1333 commented 5 years ago

@zvr No,

Dict will be used to avoid direct interaction with data base.
It will be used temporarily to track changes.
All the data from Dict will be updated on MySQL data base according to pk (Public Key of data base) and Boolean field. NOTE: T indicates modified/new entries and F indicates Deleted entries.

Please refer this rough block diagram for better understanding. Dict will be created and get destroyed when script to update data base run.

shivanshuraj1333 commented 5 years ago

@zvr I have almost completed the base script to accomplish this.
Only thing left is optimisation. What do you think about method described above?
In the proposed method website will never be down.

zvr commented 5 years ago

I still don't see why you need the dict... Since you are going to be processing the licenses one by one, why don't you update the database for each one? It seems you want to process the data, keep the "results" in a dict and then apply them to the database. Why?

The most important issue, though, is what to do with deleted licenses...

shivanshuraj1333 commented 5 years ago

It seems you want to process the data, keep the "results" in a dict and then apply them to the database. Why?

Because it will help to track deleted/modified licenses. Dict will contain an additional field Boolean. On processing licenses from https://github.com/spdx/license-list-data if a license is deleted it's Boolean field will be FALSE, else TRUE. Only license having TRUE Boolean field will be populated on database. Also, it will avoid direct interaction with database, which is a good practice.

shivanshuraj1333 commented 5 years ago

Okay, lets skip using dict. I got the more efficient way after discussion. (less resources will be used and high throughput)