AllenNeuralDynamics / aind-data-asset-indexer

MIT License
0 stars 0 forks source link

indexer does not check `is_dict_corrupt` for existing metadata.nd.json before writing to docdb #75

Open helen-m-lin opened 4 months ago

helen-m-lin commented 4 months ago

Describe the bug Some uploads from Jul 16 were not able to be written to docdb. The indexer is erroring out after WriteError in _process_prefix() and _process_codeocean_record(). It was found that the code does not first check is_dict_corrupt for existing metadata.nd.json. Additionally, the current is_dict_corrupt does not check the fieldnames in nested lists.

To Reproduce

  1. View logs for the indexer for Jul 17
  2. Observe write errors when indexing aind-private-data and the codeocean bucket
    [ERROR] WriteError: Name is not valid for storage, full error: {'index': 0, 'code': 163, 'errmsg': 'Name is not valid for storage'}

    Expected behavior

    • Existing metadata.nd.json from S3/Code ocean should first be checked to see if it is corrupt using aind_data_access_api.utils.is_dict_corrupt.
    • If corrupt, skip the upload to s3 and log a warning/error that includes the s3 location of the invalid file.
    • is_dict_corrupt should check nested lists recursively.

Additional context The errors are currently causing the job to crash completely. A hotfix will be implemented to add error handling for processing each record.

mekhlakapoor commented 4 months ago

@helen-m-lin did that linked PR fix this issue? If so we can go ahead and close this out

helen-m-lin commented 4 months ago

@mekhlakapoor no, it was just a hotfix to add error handling so the indexer job doesn't crash completely. We still need this bug ticket to resolve the actual issue.