LLNL / scraper

Python library for getting metadata from source code hosting tools
MIT License
49 stars 23 forks source link

Reading metadata from additional file #45

Open StephenQuirolgico opened 4 years ago

StephenQuirolgico commented 4 years ago

@IanLee1521 - Can't recall if this was already requested elsewhere, but is it possible to enhance the scraper to also read metadata from an additional file in a repo? The rationale would be to allow developers to have more control over the metadata that is provided, and to provide metadata that may not be scraped by the scraper.

leebrian commented 4 years ago

I think it would be helpful to read a code.json file in the root of the repo. During the GSA calls, at least two programs said they did something similar. I would like to bring this up on a GSA call and have them put out some guidance on code.gov to help shape the implementation here.

The local process we use on top of scraper is to read a code.json and use its values to override the project settings in the combined agency code.json. It's a bit of a hack, but it lets me use the exact same schema. We do this on the openCDC repo.

IanLee1521 commented 4 years ago

Certainly doable, I believe this was last on @jcastle's plate as there was to be a discussion in the bi-weekly calls (or other spin off calls) to figure out the best way to implement this. (and e.g. what to name the file).

jcastle-zz commented 4 years ago

Let's add this to the metadata brainstorm. Will send out an invite for that discussion to begin next week.

IanLee1521 commented 4 years ago

I will wait for the official answer from @jcastle / Amin but I propose that we name the file .code_gov.json and that it should have the same format as the “repository” object in the metadata schema (currently called “release”).

If it does, any fields that match what comes from the API will be replaced. Example from gsa.gov/code.json, where all the values are explicitly in the file:

{
      "contact": {
        "URL": "https://github.com/18F",
        "email": "18f@gsa.gov"
      },
      "date": {
        "created": "2013-07-17",
        "lastModified": "2019-05-02"
      },
      "description": "A hosted, shared-service that provides an API key, analytics, and proxy solution for government web services.",
      "downloadURL": "https://api.github.com/repos/18F/api.data.gov/downloads",
      "homepageURL": "https://github.com/18F/api.data.gov",
      "laborHours": 1216,
      "languages": [
        "HTML",
        "Ruby",
        "CSS",
        "JavaScript"
      ],
      "name": "api.data.gov",
      "organization": "18F",
      "permissions": {
        "licenses": [
          {
            "name": "NOASSERTION"
          }
        ],
        "usageType": "openSource"
      },
      "repositoryURL": "https://github.com/18F/api.data.gov",
      "status": "Development",
      "tags": [
        "github"
      ],
      "vcs": "git"
}

Example where only a couple fields (tags and contact:email) are overridden:

{
      "contact": {
        "email": "jcastle@gsa.gov"
      },
      "tags": [
        "github",
        "code_gov"
      ]
}

What do you all think of that?

jcastle-zz commented 4 years ago

@JosephAmalfitanoSSA, @aminPIC