NIAID-Data-Ecosystem / nde-crawlers

Harvesting infrastructure to collect and standardize dataset and computational tool metadata
Apache License 2.0
0 stars 0 forks source link

[Parser Fix]: Change SRA parser handling of 'isBasedOn' values #112

Open gtsueng opened 8 months ago

gtsueng commented 8 months ago

Background: On Tuesday October 17th, the Production API went down. According to @DylanWelzel's investigations, the cause was due to SRA's excessively large metadata records, where many records had in excess of 1000 objects listed in the 'isBasedOn' field. This was addressed by @everaldorodrigo adjusting the memory size, but the core issue is excessively large SRA metadata record

An SRA record is project or study-based. Each record may reference thousands of runs, experiments, samples, etc. This is causing issues with memory when trying to query SRA records.

  1. Revisit the metadata that is being parsed into the 'isBasedOn' property
  2. Investigate potential changes to the parser that can address the core issue:
    • Parse multiple records of the same type to the same 'IsBasedOn' object. Since the identifier field can be an array, it's possible to cut down the number of repetitive 'isBasedOn' objects which only differ by 'identifier'
    • If this doesn't work, set an upper limit on the number of 'isBasedOn' objects to parse, then add some sort of indicator that the user should visit SRA if they want to see more