GSA / code-gov-api

API powering the code.gov source code harvester
http://code.gov
Other
53 stars 28 forks source link

API Code.JSON Harvesting Process Repository Count Discrepancy #310

Closed RicardoAReyes closed 5 years ago

RicardoAReyes commented 5 years ago

Department of Energy (DOE) generates a code.json file inventory on the daily basis. URL: http://energy.gov/code.json

As of January 30th, the DOE code.json files has 1466 repositories, 56 of which are exempt by Law and 1410 are openSource. Naturally, the code.gov website should list 1410 repositories for the DOE.

https://code.gov/browse-projects?&agencies=doe&page=1&size=10&sort=data_quality

However, the code.gov website only shows 1390. screen shot 2019-01-30 at 2 34 04 pm

Discrepancy: 20 repositories. DOE and I have not been able to identify the 20 repos that are unaccounted for on the code.gov website.

Furthermore, I have visually inspected the 4,398 date properties that are empty on the code.gov API: https://api.code.gov/status/DoE/issues

All date properties issues are "warning" related to optional date field (created, lastModified, metadataLastUpdate) which are empty. This does not impact the harvesting process.

{
  "keyword": "format",
  "dataPath": ".date.created",
  "schemaPath": "#/properties/date/properties/created/format",
  "params": {
    "format": "date"
  },
  "message": "should match format \"date\""
}

{
  "keyword": "type",
  "dataPath": ".date.lastModified",
  "schemaPath": "#/properties/date/properties/lastModified/type",
  "params": {
    "type": "string"
  },
  "message": "should be string"
}

{
  "keyword": "format",
  "dataPath": ".date.metadataLastUpdated",
  "schemaPath": "#/properties/date/properties/metadataLastUpdated/format",
  "params": {
    "format": "date"
  },
  "message": "should match format \"date\""

To Reproduce Steps to reproduce the behavior:

  1. Go to http://energy.gov/code.json
  2. Copy the code.json raw data into the validator.
  3. Review the repository count for the repos there of type openSource and GovernmentWideReuse.
  4. Go to https://code.gov/browse-projects?&agencies=doe&page=1&size=10&sort=data_quality
  5. Notice the repository count for DOE does not match the expected count from the DOE code.json file.

Furthermore, I have manually executed the Code.gov Harvester locally. The harvester does index the expected repository count, 1466. code-api-harvester

Perhaps the issue is not with the API, but maybe the Elastic search solution or combination of both? The front-end makes an API call the request the repo count by agency and list of repository items.

RicardoAReyes commented 5 years ago

More repo count discrepancies that could be API/Elastic Search related.


DOJ Code.JSON has 55 repositories (15 openSource, 16 GovernmentWideReuse, 24 exempt repos), expected to show 31 repository count. https://www.justice.gov/code.json

The code.gov website shows 13 repositories (all as openSource only), the list does not show the 16 GovernmentWideReuse repositories.


RicardoAReyes commented 5 years ago

We would like to validate "duplicate" repositories for DOL, NASA, and GSA. If their code.json software inventory have discrepancies.


DOL (expected 68 repos, 60 openSource - 8 governmentWideReuse) https://www.dol.gov/code.json

Code.gov website shows 67 repositories.

Tasks - validate duplicate repo or identify discrepancy.


NASA (expected 1010 repos, 295 openSource - 715 governmentWideReuse) https://code.nasa.gov/code.json

Code.gov website shows 1003 repositories.


GSA (expected 1663 repos, 1649 openSource - 14 governmentWideReuse) https://code.nasa.gov/code.json

Code.gov website shows 1661 repositories.


vowelllDOE commented 5 years ago

@jcastle @RicardoAReyes any more thoughts on this one?

jcastle-zz commented 5 years ago

Hi @vowelllDOE, still working it. We lost our back-end dev during the furlough and have a replacement on board getting up to speed. We should be able to get resolved by end of next week. Thanks for your patience!

vowelllDOE commented 5 years ago

@jcastle @RicardoAReyes any updates on this yet?

RicardoAReyes commented 5 years ago

Hi @vowelllDOE,

As @jcastle explained, we now have a new script to help us identify the duplicate repositories. We can easily determine the reason for the discrepancies, our API/Harvester does not allow for duplicate repo titles and/or duplicate repositoryURLs.

DOE has 140 repos that have been flagged as duplicate, and therefore only indexed once on Elasticsearch.

screen shot 2019-02-22 at 10 57 41 am

screen shot 2019-02-22 at 10 58 08 am

Full Reports DOE-DuplicateRepos.zip

@ianlee1521 Do you think that the scraper may be duplicating repos from the same organization?

Happy to discuss the reports over a phone call today.

vowelllDOE commented 5 years ago

That is an unfortunate side effect for DOE and how we collect data from our distributed environment of users. I fear this will mean that our count is never truly accurate in Code.gov.

And for context and information, Ian's "scraper" has nothing to do with how we ultimately collect date from DOE sites and grantees.

IanLee1521 commented 5 years ago

I don't think this would be Scraper, as the source data from DOE comes from the OSTI backend...

As far as duplicates, that first example (positions 79 and 6) don't actually have the same name / usage type, so it doesn't look like that's a duplicate.

vowelllDOE commented 5 years ago

agree with @IanLee1521 on both regards. I would like to discuss this with Code.gov and see if there is a way this could be resolved.

RicardoAReyes commented 5 years ago

That's true, we do need to check for "usageType" which is something that we can improve on the manual script.

vowelllDOE commented 5 years ago

I also think at minimum the duplication should be based on duplicate names, urls, and usagetypes. We have many labs that host landing pages for multiple open source software packages with the same URL that provides information on how to obtain the software.

RicardoAReyes commented 5 years ago

@vowelllDOE I agree with your statement, we need to account for "usageType". Going to review the script w/Engineering team to get updated report.

bjbhatt commented 5 years ago

This issue is resolved and should be close @RicardoAReyes @saracope