GSA / code-gov-api

API powering the code.gov source code harvester
http://code.gov
Other
53 stars 28 forks source link

Elastic search Indexing should use usageType for unique key or Globally Unique Identifier (GUID) #311

Closed RicardoAReyes closed 5 years ago

RicardoAReyes commented 5 years ago

DOE and SSA have presented Elastic search indexing issue in which repos as seem as duplicate on our platform as false positive.

DOE (@vowelllDOE, @IanLee1521) has multiple repos with the same title from the same organization version. The Code.gov harvesting process overrides existing repos with the same agency + organization + title to generate a unique key for the index.

Current Code.gov Elastic Search harvesting /indexing process: screen shot 2019-02-26 at 12 56 49 pm

DOE Example: doe-duplicate

DOE has recommended for Code.gov to address this use case scenario. It's my assumption that a potential business case could be for organizations to have one repo as Open Source for public consumption, and an alternative for internal use for production use. However, the agency has reported both repos since they could differ in source-code.

Recommendation 1: Introduce repo "usageType" as part of the Elastic Search unique key. screen shot 2019-02-26 at 1 07 51 pm

This solution would introduce the usageType on the code.gov repo URL. The solution could have the full usageType title or variable such as o = Open Source, g = Government Wide Reuse, e = exempt by x.

Recommendation 2: (long term solution) @SSAgov (SSA) has recommended that use a Globally Unique Identifier (GUID) for each repo. The Code.gov Metadata Schema 2.0.0 should introduce a new property for the agencies to enter a GUID for the repository.

Related Issue addressing repo duplicate count with agencies. https://github.com/GSA/code-gov-api/issues/310

@jcastle @saracope @AminPIC thoughts...?

vowelllDOE commented 5 years ago

I think it should also take "name" into consideration for this deduplication. Many sites uses a single landing page for access to multiple packages of various source codes. You can see this in the example above with https://ip.sandia.gov/contact.do. Those two packages are completely different, they just share the same URL.

RicardoAReyes commented 5 years ago

@vowelllDOE the process does use the repo name (title) into consideration already. It's variable that is being used to help identify duplicates from the same organization.

vowelllDOE commented 5 years ago

@RicardoAReyes then I'm confused how the example you have above is a dupe

bjbhatt commented 5 years ago

@RicardoAReyes even after including UsageType for DOE, I'm still finding 18 duplicates (e.g. record at position 371 still matches position 260)

bjbhatt commented 5 years ago

@vowelllDOE when generating a Unique ID for a repo record, the system currently, as programmed, uses combination of Agency + Organization + Name

Example: agency: DOE organization: Lawrence Livermore National Laboratory (LLNL) name: AMPE Unique Id (generated): doe_lawrence_livermore_national_laboratory_llnl_ampe

vowelllDOE commented 5 years ago

@bjbhatt

Please forgive my ignorance, then how do these duplicate?

image

and @RicardoAReyes keeps mentioning the repo url in the dupe check

bjbhatt commented 5 years ago

@vowelllDOE

I think there is a confusion. there are two distinct issues.

  1. Unique Key generation: where Agency+Organization+Name matches another record (see record 371 and 260 in your code.json file - zero-based). This is a technical issue, where the previous record gets overwritten by the newer one. Looking at the data further, even if we include UsageType we would still have duplicates.
  2. A repositoryURL for an "openSource" project matches another repositoryURL (see record 141 and 129 in your code.json file - zero-based). This is NOT a technical issue but "should this be the case"
vowelllDOE commented 5 years ago

@bjbhatt

For #2, yes that should be the case. As described above Many sites uses a single landing page for access to multiple packages of various source codes. You can see this in the example above with https://ip.sandia.gov/contact.do. Those two packages are completely different, they just share the same URL.

bjbhatt commented 5 years ago

@vowelllDOE

1 - We are talking internally to figure out a solution for generating unique id differently, will keep you posted. After adding UsageType as a part of the key, combination of Agency+Organization+Name+UsageType should be unique, do you concur?

2 - understood, this was just an FYI that are repositoryURLs which matches other repo's URL, if there is a business case for it, it is perfectly fine by us.

IanLee1521 commented 5 years ago

@bjbhatt -- Can you provide a link / list / report of the unique IDs for all of (18 it sounds like?) duplicated agency+org+repo_names ?

bjbhatt commented 5 years ago

@IanLee1521 here it is. DOE-duplicateName.json.txt It is a JSON file and position is zero-based.

Also attached is the DOE's code.json file that we retrieved. DOE.code.json.txt

bjbhatt commented 5 years ago

This issue is resolved and should be closed @RicardoAReyes @saracope.