NIAID-Data-Ecosystem / nde-crawlers

Harvesting infrastructure to collect and standardize dataset and computational tool metadata
Apache License 2.0
0 stars 0 forks source link

[Metadata Augmentation] Clean up alternateNames #122

Closed gtsueng closed 2 months ago

gtsueng commented 5 months ago

The alternateNames field for the augmented species, infectiousAgent, and healthCondition appear to have a lot of duplicates which can potentially affect the size each record and ultimately the search performance: See this example in Vivli: https://data.niaid.nih.gov/resources?id=VIVLI_31a0311f-3e6e-41d7-92cd-10ca9d666038 image

Looking at the metadata displayed, it looks like "Asthma" has been repeated 8 times, and then once with a capitalization variation. Perform a de-duplication check on the alternateNames field for augmented metadata to keep things clean.

NIAID Review status

It is expected that NIAID will not need to review an issue with this level of granularity

Related WBS Task

https://github.com/NIAID-Data-Ecosystem/nde-roadmap/issues/13

gtsueng commented 3 months ago

The issue appears to have been addressed. The status of this issue will be changed to 'pending closure'.