AllenNeuralDynamics / aind-data-access-api

Library to interface with AIND databases
MIT License
2 stars 0 forks source link

Assets are in code ocean, but do not show up when querying MetadataDbClient #91

Closed alexpiet closed 1 week ago

alexpiet commented 1 month ago

Describe the bug Data assets that exist on Code Ocean are not in the MetadataDbClient queries

To Reproduce This asset exists on code ocean: https://codeocean.allenneuraldynamics.org/data-assets/0bb25279-188b-4804-b6f5-930154bdaed0/behavior_711042_2024-09-05_09-11-52/?fullScreen=true&hideDetails=true

Yet the following query fails to find it, but finds other assets produced in the same manner:

  from aind_data_access_api.document_db import MetadataDbClient

  client = MetadataDbClient(
        host='api.allenneuraldynamics.org',
        database='metadata_index',
        collection='data_assets'
        )

    datastr = '2024-09-05'

    # Do query for all sessions from that day
    raw_results = pd.DataFrame(client.retrieve_docdb_records(filter_query={
        "name": {"$regex": "^behavior_[0-9]*_{}_[0-9-]*$".format(datestr)}
        }))
   # returns lots of sessions, but not the one mentioned above

    # Query just for this mouse
    raw_results = pd.DataFrame(client.retrieve_docdb_records(filter_query={
        "name": {"$regex": "^behavior_711042_{}_[0-9-]*$".format(datestr)}
        }))
    # returns an empty dataframe

Expected behavior All assets on code ocean should be available. I do not know how often this happens, but it does appear to have happened more than once.

Desktop (please complete the following information):

alexpiet commented 1 month ago

Another instance of this: https://codeocean.allenneuraldynamics.org/data-assets/2177efd5-5635-48c3-87e0-a5c9227be9d1/behavior_717531_2024-09-11_09-11-19/?fullScreen=true&hideDetails=true

alexpiet commented 1 month ago

Both assets are still missing. Confirming this isn't an issue with a delayed update of the database

helen-m-lin commented 1 month ago

Hi @alexpiet, thanks for flagging this. I looked into the 2 assets you shared. For both, we are getting errors due to corrupt keys in session.json. Specifically, fieldnames that contain . or $ cannot be written to AWS DocDB.

image

To resolve, the session jsons need to be updated. Looping @saskiad in to provide guidance on how these fields should be updated.

FYI there are ~100 data assets that are raising the same error. They seem to all have these .csv fields. 2024-09-23_indexer_corrupt_json.csv

alexpiet commented 1 month ago

Thanks for the update @helen-m-lin.

The session.json files were made with aind-data-schema (probably v0.36.0). So if the DocDB doesn't allow those characters in fieldnames, we should make sure the data-schema doesn't allow them either, or the script that generate the DocDB should handle some conversion

helen-m-lin commented 1 month ago

@alexpiet Agreed.

we should make sure the data-schema doesn't allow them either

There is a ticket to address this in aind-data-schema.

the script that generate the DocDB should handle some conversion

This will be handled in aind-data-asset-indexer after this bug is fixed. The default behavior is to raise a warning about any invalid file like the session.json here, but still add the metadata record to DocDB (without the invalid file). Please let me know if this is sufficient.

alexpiet commented 1 month ago

That is sufficient. Thank you

helen-m-lin commented 1 month ago

@alexpiet For the data assets that are currently missing from DocDB, I can handle uploading new session files and re-indexing. I would need the fixed session jsons.

Alternatively, we can archive the invalid session files and re-index immediately. Please let me know which you prefer.

saskiad commented 1 month ago

Hi @alexpiet, thanks for flagging this. I looked into the 2 assets you shared. For both, we are getting errors due to corrupt keys in session.json. Specifically, fieldnames that contain . or $ cannot be written to AWS DocDB.

image

To resolve, the session jsons need to be updated. Looping @saskiad in to provide guidance on how these fields should be updated.

FYI there are ~100 data assets that are raising the same error. They seem to all have these .csv fields. 2024-09-23_indexer_corrupt_json.csv

Yeah this is a problem with these files - sorry this has been on my plate to address but the SAC got in the way. These session files have some big problems. We should discuss more to fix them. I mentioned something to Bruno a little while ago but haven't had a chance to fully document the problems yet. I'll try to do it in the next few days

alexpiet commented 1 month ago

@alexpiet For the data assets that are currently missing from DocDB, I can handle uploading new session files and re-indexing. I would need the fixed session jsons.

Alternatively, we can archive the invalid session files and re-index immediately. Please let me know which you prefer.

@helen-m-lin I'm not sure I understand what you are asking. By "archive" do you mean removing the session.json files from the CO asset? I don't think we want that.

I think the fix we want for these files is to simply remove the ".csv" from these fieldnames. "bottom_camera.csv" to "bottom_camera"

EDIT: I fixed the issue for new sessions, but it won't get merged into production until next wednesday. I therefore propose that we wait until then, then make a script that updates the session.json files to remove the ".csv"

alexpiet commented 1 month ago

Yeah this is a problem with these files - sorry this has been on my plate to address but the SAC got in the way. These session files have some big problems. We should discuss more to fix them. I mentioned something to Bruno a little while ago but haven't had a chance to fully document the problems yet. I'll try to do it in the next few days

@saskiad Happy to update the session.json files when you have documented the problems

helen-m-lin commented 1 month ago

next wednesday. I therefore propose that we wait until then, then make a script that updates the session.json files to remove the ".csv"

Sounds like a plan. I'll prepare the script to find existing session files with the issue and remove the ".csv".

@saskiad / @alexpiet, please let me know if other changes are required, though it might be better to fix the ".csv" issue first to resolve the DocDB/indexing portion.

saskiad commented 1 month ago

There are a lot of problems with these files that we should fix.

alexpiet commented 1 month ago

@helen-m-lin The code update that resolves the .csv in the field name has been pushed to the production rigs, so all new uploaded data will be correct. So now it a good time to fix the field name in older data assets. Let me know if you need anything from me

helen-m-lin commented 4 weeks ago

106 behavior and 271 ecephys sessions updated in S3. I've checked that they are now being indexed. We still need to address the remaining issues with session files documented by @saskiad. cc: @jtyoung84

jtyoung84 commented 1 week ago

@alexpiet Are we good closing this issue?