Closed alexpiet closed 1 week ago
Both assets are still missing. Confirming this isn't an issue with a delayed update of the database
Hi @alexpiet, thanks for flagging this. I looked into the 2 assets you shared. For both, we are getting errors due to corrupt keys in session.json
. Specifically, fieldnames that contain .
or $
cannot be written to AWS DocDB.
To resolve, the session jsons need to be updated. Looping @saskiad in to provide guidance on how these fields should be updated.
FYI there are ~100 data assets that are raising the same error. They seem to all have these .csv
fields.
2024-09-23_indexer_corrupt_json.csv
Thanks for the update @helen-m-lin.
The session.json files were made with aind-data-schema (probably v0.36.0). So if the DocDB doesn't allow those characters in fieldnames, we should make sure the data-schema doesn't allow them either, or the script that generate the DocDB should handle some conversion
@alexpiet Agreed.
we should make sure the data-schema doesn't allow them either
There is a ticket to address this in aind-data-schema.
the script that generate the DocDB should handle some conversion
This will be handled in aind-data-asset-indexer after this bug is fixed. The default behavior is to raise a warning about any invalid file like the session.json here, but still add the metadata record to DocDB (without the invalid file). Please let me know if this is sufficient.
That is sufficient. Thank you
@alexpiet For the data assets that are currently missing from DocDB, I can handle uploading new session files and re-indexing. I would need the fixed session jsons.
Alternatively, we can archive the invalid session files and re-index immediately. Please let me know which you prefer.
Hi @alexpiet, thanks for flagging this. I looked into the 2 assets you shared. For both, we are getting errors due to corrupt keys in
session.json
. Specifically, fieldnames that contain.
or$
cannot be written to AWS DocDB.To resolve, the session jsons need to be updated. Looping @saskiad in to provide guidance on how these fields should be updated.
FYI there are ~100 data assets that are raising the same error. They seem to all have these
.csv
fields. 2024-09-23_indexer_corrupt_json.csv
Yeah this is a problem with these files - sorry this has been on my plate to address but the SAC got in the way. These session files have some big problems. We should discuss more to fix them. I mentioned something to Bruno a little while ago but haven't had a chance to fully document the problems yet. I'll try to do it in the next few days
@alexpiet For the data assets that are currently missing from DocDB, I can handle uploading new session files and re-indexing. I would need the fixed session jsons.
Alternatively, we can archive the invalid session files and re-index immediately. Please let me know which you prefer.
@helen-m-lin I'm not sure I understand what you are asking. By "archive" do you mean removing the session.json files from the CO asset? I don't think we want that.
I think the fix we want for these files is to simply remove the ".csv" from these fieldnames. "bottom_camera.csv" to "bottom_camera"
EDIT: I fixed the issue for new sessions, but it won't get merged into production until next wednesday. I therefore propose that we wait until then, then make a script that updates the session.json files to remove the ".csv"
Yeah this is a problem with these files - sorry this has been on my plate to address but the SAC got in the way. These session files have some big problems. We should discuss more to fix them. I mentioned something to Bruno a little while ago but haven't had a chance to fully document the problems yet. I'll try to do it in the next few days
@saskiad Happy to update the session.json files when you have documented the problems
next wednesday. I therefore propose that we wait until then, then make a script that updates the session.json files to remove the ".csv"
Sounds like a plan. I'll prepare the script to find existing session files with the issue and remove the ".csv".
@saskiad / @alexpiet, please let me know if other changes are required, though it might be better to fix the ".csv" issue first to resolve the DocDB/indexing portion.
There are a lot of problems with these files that we should fix.
@helen-m-lin The code update that resolves the .csv
in the field name has been pushed to the production rigs, so all new uploaded data will be correct. So now it a good time to fix the field name in older data assets. Let me know if you need anything from me
106 behavior and 271 ecephys sessions updated in S3. I've checked that they are now being indexed. We still need to address the remaining issues with session files documented by @saskiad. cc: @jtyoung84
@alexpiet Are we good closing this issue?
Describe the bug Data assets that exist on Code Ocean are not in the MetadataDbClient queries
To Reproduce This asset exists on code ocean: https://codeocean.allenneuraldynamics.org/data-assets/0bb25279-188b-4804-b6f5-930154bdaed0/behavior_711042_2024-09-05_09-11-52/?fullScreen=true&hideDetails=true
Yet the following query fails to find it, but finds other assets produced in the same manner:
Expected behavior All assets on code ocean should be available. I do not know how often this happens, but it does appear to have happened more than once.
Desktop (please complete the following information):