Closed RicardoAReyes closed 5 years ago
I think it should also take "name" into consideration for this deduplication. Many sites uses a single landing page for access to multiple packages of various source codes. You can see this in the example above with https://ip.sandia.gov/contact.do. Those two packages are completely different, they just share the same URL.
@vowelllDOE the process does use the repo name (title) into consideration already. It's variable that is being used to help identify duplicates from the same organization.
@RicardoAReyes then I'm confused how the example you have above is a dupe
@RicardoAReyes even after including UsageType for DOE, I'm still finding 18 duplicates (e.g. record at position 371 still matches position 260)
@vowelllDOE when generating a Unique ID for a repo record, the system currently, as programmed, uses combination of Agency + Organization + Name
Example: agency: DOE organization: Lawrence Livermore National Laboratory (LLNL) name: AMPE Unique Id (generated): doe_lawrence_livermore_national_laboratory_llnl_ampe
@bjbhatt
Please forgive my ignorance, then how do these duplicate?
and @RicardoAReyes keeps mentioning the repo url in the dupe check
@vowelllDOE
I think there is a confusion. there are two distinct issues.
@bjbhatt
For #2, yes that should be the case. As described above Many sites uses a single landing page for access to multiple packages of various source codes. You can see this in the example above with https://ip.sandia.gov/contact.do. Those two packages are completely different, they just share the same URL.
@vowelllDOE
@bjbhatt -- Can you provide a link / list / report of the unique IDs for all of (18 it sounds like?) duplicated agency+org+repo_name
s ?
@IanLee1521 here it is. DOE-duplicateName.json.txt It is a JSON file and position is zero-based.
Also attached is the DOE's code.json file that we retrieved. DOE.code.json.txt
This issue is resolved and should be closed @RicardoAReyes @saracope.
DOE and SSA have presented Elastic search indexing issue in which repos as seem as duplicate on our platform as false positive.
DOE (@vowelllDOE, @IanLee1521) has multiple repos with the same title from the same organization version. The Code.gov harvesting process overrides existing repos with the same agency + organization + title to generate a unique key for the index.
Current Code.gov Elastic Search harvesting /indexing process:
DOE Example:
DOE has recommended for Code.gov to address this use case scenario. It's my assumption that a potential business case could be for organizations to have one repo as Open Source for public consumption, and an alternative for internal use for production use. However, the agency has reported both repos since they could differ in source-code.
Recommendation 1: Introduce repo "usageType" as part of the Elastic Search unique key.
This solution would introduce the usageType on the code.gov repo URL. The solution could have the full usageType title or variable such as o = Open Source, g = Government Wide Reuse, e = exempt by x.
Recommendation 2: (long term solution) @SSAgov (SSA) has recommended that use a Globally Unique Identifier (GUID) for each repo. The Code.gov Metadata Schema 2.0.0 should introduce a new property for the agencies to enter a GUID for the repository.
Related Issue addressing repo duplicate count with agencies. https://github.com/GSA/code-gov-api/issues/310
@jcastle @saracope @AminPIC thoughts...?