aboutcode-org / scancode-server

This project is no longer maintained. Visit https://github.com/nexB/scancode.io/ instead for similar and current project
https://github.com/nexB/scancode.io/
19 stars 17 forks source link

API to scan special URL like Github URL #52

Open RajuKoushik opened 7 years ago

RajuKoushik commented 7 years ago

Create POST API endpoint to ScanCode given GitHub or BitBucket or Git URL.

singh1114 commented 7 years ago

Are you working on this?

RajuKoushik commented 7 years ago

Yeah @singh1114

pombredanne commented 7 years ago

@RajuKoushik can you be more explicit and describe more what you mean by this ticket?

RajuKoushik commented 7 years ago

@pombredanne This issue is no more a separate one. As we have discussed and planned to have a single view which accepts any kind of URL(even the special git repo URL). I have written a API which can take in requests from both the URL types and gives us the scan results as a response. https://github.com/nexB/scancode-server/pull/61

pombredanne commented 7 years ago

ok, but I need to understand what this about about. What is the problem and solution... what is the thing you want to achieve here ;)

You initial description is kinda terse and means not much... e.g.

Create POST API endpoint to ScanCode given GitHub or BitBucket or Git URL.

When you create a ticket try to be explanatory and descriptive. I am not sure what you are after and you need to write what you have in mind exactly: Assuming I get some of what this is about may be something similar to this would be better:

When a user requests a scan with the URL to a GitHub, Bitbucket, Gitlab or similar URL repository we can handles these a few different way:

  1. Treat this as a regular URL, in which case we may end up downloading and scanning a web page about a repository and not the code itself
  2. Or recognize that this is a special type of URL and be smart about what to download and then scan

The option 2 seems a much better approach as in most cases the use is likely to want the code in the repo to be scanned rather than the HTML page of a repo as rendered by Github or Bitbucket .

There are a couple considerations to get this right:

  • we need to recognize these URLs.
    • In some case they may point to a certain file or commit or branch. This should be detected properly to determine which files or branch to fetch.
    • In some other cases they may point to a direct zip or tarball download or a "raw" blob. In these case they should likely fe fetched as-is
  • once the URL has been recognized and deconstructed there are a different way things would be fetched:
    • we could use a git clone (or hg clone in some cases for Bitbucket)
    • we could use a tarball or zip download
  • then what to fetch needs to be determined is not specified
    • we can scan the head of the default branch
    • we can scan tags/releases/branches, all or some of them (though scanning them all does not make sense and we likely only want to scan one thing only)
  • since the URL or way things are fetched may not match what the user entered as a URL there may be a need to store the reference to what was effectively fetched.

When you create a ticket this is this kind of details that you should specify... otherwise there is not much that can be discussed

RajuKoushik commented 7 years ago

@pombredanne Present status of the URLScan View as in the pull request #61 -

Fetches and clones the git repository in the home directory.(This has to occur in the background which has to be changed.) apply_scan_async.delay(path, scan_id, scan_type, URL, folder_name) The fetched repo is scanned from its path in the background and then the response is redirected to the results scan.

The files are stored in a directory and then it is scanned in the background. scan_code_async.delay(URL, scan_id, path)

Things to be worked on -