Closed RicardoAReyes closed 5 years ago
More repo count discrepancies that could be API/Elastic Search related.
DOJ Code.JSON has 55 repositories (15 openSource, 16 GovernmentWideReuse, 24 exempt repos), expected to show 31 repository count. https://www.justice.gov/code.json
The code.gov website shows 13 repositories (all as openSource only), the list does not show the 16 GovernmentWideReuse repositories.
We would like to validate "duplicate" repositories for DOL, NASA, and GSA. If their code.json software inventory have discrepancies.
DOL (expected 68 repos, 60 openSource - 8 governmentWideReuse) https://www.dol.gov/code.json
Code.gov website shows 67 repositories.
Tasks - validate duplicate repo or identify discrepancy.
NASA (expected 1010 repos, 295 openSource - 715 governmentWideReuse) https://code.nasa.gov/code.json
Code.gov website shows 1003 repositories.
GSA (expected 1663 repos, 1649 openSource - 14 governmentWideReuse) https://code.nasa.gov/code.json
Code.gov website shows 1661 repositories.
@jcastle @RicardoAReyes any more thoughts on this one?
Hi @vowelllDOE, still working it. We lost our back-end dev during the furlough and have a replacement on board getting up to speed. We should be able to get resolved by end of next week. Thanks for your patience!
@jcastle @RicardoAReyes any updates on this yet?
Hi @vowelllDOE,
As @jcastle explained, we now have a new script to help us identify the duplicate repositories. We can easily determine the reason for the discrepancies, our API/Harvester does not allow for duplicate repo titles and/or duplicate repositoryURLs.
DOE has 140 repos that have been flagged as duplicate, and therefore only indexed once on Elasticsearch.
Full Reports DOE-DuplicateRepos.zip
@ianlee1521 Do you think that the scraper may be duplicating repos from the same organization?
Happy to discuss the reports over a phone call today.
That is an unfortunate side effect for DOE and how we collect data from our distributed environment of users. I fear this will mean that our count is never truly accurate in Code.gov.
And for context and information, Ian's "scraper" has nothing to do with how we ultimately collect date from DOE sites and grantees.
I don't think this would be Scraper, as the source data from DOE comes from the OSTI backend...
As far as duplicates, that first example (positions 79 and 6) don't actually have the same name / usage type, so it doesn't look like that's a duplicate.
agree with @IanLee1521 on both regards. I would like to discuss this with Code.gov and see if there is a way this could be resolved.
That's true, we do need to check for "usageType" which is something that we can improve on the manual script.
I also think at minimum the duplication should be based on duplicate names, urls, and usagetypes. We have many labs that host landing pages for multiple open source software packages with the same URL that provides information on how to obtain the software.
@vowelllDOE I agree with your statement, we need to account for "usageType". Going to review the script w/Engineering team to get updated report.
This issue is resolved and should be close @RicardoAReyes @saracope
Department of Energy (DOE) generates a code.json file inventory on the daily basis. URL: http://energy.gov/code.json
As of January 30th, the DOE code.json files has 1466 repositories, 56 of which are exempt by Law and 1410 are openSource. Naturally, the code.gov website should list 1410 repositories for the DOE.
https://code.gov/browse-projects?&agencies=doe&page=1&size=10&sort=data_quality
However, the code.gov website only shows 1390.
Discrepancy: 20 repositories. DOE and I have not been able to identify the 20 repos that are unaccounted for on the code.gov website.
Furthermore, I have visually inspected the 4,398 date properties that are empty on the code.gov API: https://api.code.gov/status/DoE/issues
All date properties issues are "warning" related to optional date field (created, lastModified, metadataLastUpdate) which are empty. This does not impact the harvesting process.
To Reproduce Steps to reproduce the behavior:
Furthermore, I have manually executed the Code.gov Harvester locally. The harvester does index the expected repository count, 1466.
Perhaps the issue is not with the API, but maybe the Elastic search solution or combination of both? The front-end makes an API call the request the repo count by agency and list of repository items.