IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
171 stars 111 forks source link

New transformer for license and copyright header removal #332

Closed ykalathiya closed 2 months ago

ykalathiya commented 3 months ago

Why are these changes needed?

It's a new transform module which removes license and copyright header from the input code data. This transforms module depends on (scancode-toolkit)[https://pypi.org/project/scancode-toolkit].

Related issue number (if any).

Closes: #63

daw3rd commented 3 months ago

You need a license_copyright_remove/README.md

Can we find a shorter name? Maybe legal_removal, common_removal?

Excellent! Thanks.

Bytes-Explorer commented 3 months ago

You need a license_copyright_remove/README.md

Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

ykalathiya commented 3 months ago

You need a license_copyright_remove/README.md Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

legal_clean or code_clean?

ykalathiya commented 3 months ago

You need a license_copyright_remove/README.md Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

Or this legal_remove is good?

Bytes-Explorer commented 3 months ago

You need a license_copyright_remove/README.md Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

Or this legal_remove is good?

The functionality is to remove the copyright headers, right or are we missing anything?

Bytes-Explorer commented 3 months ago

You need a license_copyright_remove/README.md Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

Or this legal_remove is good?

The functionality is to remove the copyright headers, right or are we missing anything?

legal_removal is too broad and could be misleading. Same with code_clean. Please suggest something so that one knows what is the functionality of this module.

ykalathiya commented 3 months ago

You need a license_copyright_remove/README.md Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

Or this legal_remove is good?

The functionality is to remove the copyright headers, right or are we missing anything?

Functionality is to remove license and copyright.

ykalathiya commented 3 months ago

You need a license_copyright_remove/README.md Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

Or this legal_remove is good?

The functionality is to remove the copyright headers, right or are we missing anything?

legal_removal is too broad and could be misleading. Same with code_clean. Please suggest something so that one knows what is the functionality of this module.

Then we have to use license or copyright in name which makes the bigger name or we can go with only one functionality like license_cleaner.

Param-S commented 3 months ago

Can we name this module as "header_cleanser" as it specifically looks at the copyright info at the beginning of the file.

ykalathiya commented 3 months ago

Can we name this module as "header_cleanser" as it specifically looks at the copyright info at the beginning of the file. Okk that’s a good name. I’ll change the module name

daw3rd commented 3 months ago

Also, can you please run make conventions in each of the ray and python directories to be sure there are not MUSTs.

Bytes-Explorer commented 3 months ago

Can we name this module as "header_cleanser" as it specifically looks at the copyright info at the beginning of the file.

Can we name this module as "header_cleanser" as it specifically looks at the copyright info at the beginning of the file.

I am fine with that

Bytes-Explorer commented 3 months ago

@ykalathiya How many languages will this work for? Have we done any testing?

ykalathiya commented 3 months ago

@ykalathiya How many languages will this work for? Have we done any testing?

I have used this repo https://github.com/arjuncvinod/Hello-World-in-Different-Languages . Added license manually in file and it detected all. This is the list of language. [ 'js', 'intercal', 'ejs', 'php', 'cl', 'vhd', 'fs', 'applescript', 'ahk', 'java', 's', 'xml', 'xml', 'sol', 'cbl', 'ts', 'a68', 'ml', 'swift', 'coffee', 'chpl', 'pas', 'jsp', 'asm', 'sc', 'mat', 'txt', 'be', 'go', 'cpp', 'dart', 'bhai', 'erl', 'ps1', 'mojo', 'nut', 'chef', 'BAS', 'pyx', 'css', 'tcl', 'vb', 'py', 'm', 'lua', 'for', 'jl', 'ps', 'f95', 'ts', 'rb', 'rkt', 'sql', 'factor', 'nix', 'e', 'sh', 'pas', 'c', 'factor', 'sh', 'abap', 'js', 'jaksel', 'zig', 'bas', 't', 'bf', 'ex', 'txt', 'asm', 'r', 'hack', 'bas', 'lsp', 'pl', 'kt', 'st', 'pike', 'hs', 'm', 'lol', 'ads', 'v', 'fish', 'sml', 'sh', 'fth', 'jl', 'java', 'sas', 'rs' ]

Bytes-Explorer commented 3 months ago

https://github.com/arjuncvinod/Hello-World-in-Different-Languages

Fantastic! Thank you!

ykalathiya commented 3 months ago

Also, can you please run make conventions in each of the ray and python directories to be sure there are not MUSTs.

make conventions run successfully on both python and ray.

daw3rd commented 3 months ago

Could you please merge dev into this branch given some recent issues with versioning. See https://github.com/IBM/data-prep-kit/issues/355

ykalathiya commented 3 months ago

I have added kfp_ray directory and also in README file.

Param-S commented 2 months ago

@daw3rd I see all the review comments are addressed now. can you please check & merge this PR.