Data upload notes (meeting with David and Jon)

alexpiet commented 5 months ago

We can use the Sci. Comp data transfer service to automate data uploading to AWS.

https://github.com/AllenNeuralDynamics/aind-data-transfer-service
Move data to VAST, then trigger the upload

If you want to manually upload some data, you can:

go to http://aind-data-transfer-service/
Upload a .csv file that contains
- platform: a string that defines the platform, like "FIP" or "behavior"
- iso8601 format for date time
- the subfolder for each data modality
- See example csv below
select "preview"
select upload
Starts a SLURM job, and when finished data is available on code-ocean

To automate this, use the REST API (https://github.com/AllenNeuralDynamics/aind-data-transfer-service)

@jtyoung84 will provide an example script
any schema jsons should be in the data folder
the .csv can be on your local machine
can trigger a code-ocean pipeline to run on the uploaded data asset

Behavior specific notes

We should add some checks to make sure the data copies to VAST correctly
Any empty folders will cause the upload the fail, we should not create data modality folders that are empty, for example behavior video, photometry, or ephys
When the user saves the session, we should automatically trigger copying the data to VAST, and the upload service
One issue is that all data modalities should be uploaded together.
- Proposal: If its behavior ONLY, not FIP or ephys, then the behavior code triggers the upload. Otherwise, the neural data code needs to trigger the upload for the behavior and neural data.
- @hagikent Any thoughts? Does this proposal work for you?
We should copy all raw data to VAST, and we can convert to NWB as part of the upload process
- Should we include the Harp messages in the NWB, or just copy to AWS? @bruno-f-cruz any thoughts?
Need to agree upon the control vocabulary. If we have behavior and ephys, is this "ecephys", or "behavior"? Either is fine, but we should be consistent. These labels are just used to search for data.
@hanhou is going to coordinate the implementation for the dynamic-foraging-task

alexpiet commented 5 months ago

Example csv file for upload definitions test_for_data_upload.csv

hanhou commented 5 months ago

We need a way to specify all modalities used in a session so that the uploading script knows what modalities are supposed to be in VAST before triggering the upload. This is also a part of generating session.json metadata.

One possibility may be:

When saving the data or before a session, in the behavior GUI, the experimenter select all modalities that will be used. The default is behavior only on the training rigs. This will be reflected in the field data_stream of session.json
Upon saving, the GUI creates session.json and copys it together with the behavioral data to VAST.
Another background script (?) periodically checks whether all modalities specified in session.json are ready in VAST. If yes, trigger the upload.

Remaining problems:

How can the behavior GUI alone generate a full session.json that includes metadata from other modalities?
Where does the background script run?
- Maybe we don't need it at all? If I understand correctly, now the upload job fails if any folder specified in .csv is empty. Can we make it retry instead of failing in this case?

@jtyoung84 @dyf any ideas?

hanhou commented 5 months ago

Need to agree upon the control vocabulary. If we have behavior and ephys, is this "ecephys", or "behavior"? Either is fine, but we should be consistent. These labels are just used to search for data.

How about this?

I propose giving behavior_ a higher priority because of the logical continuity from "behavior" to "behavior + ephys/ophys" sessions (e.g., tracked by the same behavior curriculum manager), while the connection between "ephys" to "behavior + ephys" sessions is somewhat weaker.

jtyoung84 commented 5 months ago

@alexpiet Here is a small script that can be used to hit the api directly without using the webapp. I think it might be possible to avoid creating the csv file directly and just use the BasicUploadJobConfigs class directly, but I'll look into it later. For now, this script assumes the csv file exists and can be read.

import requests
import json

# the dev domain is useful for testing things
# domain = "http://aind-data-transfer-service-dev"
# The following is the prod endpoint
domain = "http://aind-data-transfer-service" 

# Example File. Can be created and stored locally
# without copying to VAST, as long as it is accessible
# from wherever the automated script is running
# I think an additional validation check is done, so it may
# be possible to skip this step and create the json
# object directly.
csv_file_path = "./tests/resources/sample.csv"
with open(csv_file_path, "rb") as f:
  files = {
    "file": f,
  }
  validate_csv_response = requests.post(url=f"{domain}/api/validate_csv", files=files)

if validate_csv_response.status_code != 200:
  # There was an error validating data (406) or a server error (500)
  print(validate_csv_response.json())
else:
  # The response validates the contents and returns the parsed contents as json
  # It's possible to skip the validate csv step and create the json object directly
  # Using the BasicUploadJobConfigs class. I'll explore it later.
  upload_job_configs = validate_csv_response.json()["data"]["jobs"]

  # Can optionally add user email and fine tune hpc configs
  # The default hpc configs should suffice for most jobs
  hpc_settings = json.dumps({})

  submit_jobs_request = {"jobs": [{"hpc_settings": hpc_settings, "upload_job_settings": j} for j in upload_job_configs]}
  # It looks like this does an additional validation check, so it might be possible to skip the csv file creation
  submit_job_response = requests.post(
    url=f"{domain}/api/submit_hpc_jobs", json=submit_jobs_request
  )

  if submit_job_response.status_code != 200:
    # An error occurred sending the request to the HPC
    print(submit_job_response.json())
  else:
    # The request was sent to the HPC. It's possible the hpc
    # may run into errors. For now, the logs get stored in
    # /allen/aind/scratch/svc_aind_upload/logs/prod/{asset_name}.out and
    # /allen/aind/scratch/svc_aind_upload/logs/prod/{asset_name}_error.out
    # Future work will be to make the logs more accessible, at least the
    # errors. The status of the job can also be viewed at:
    # f"{domain}/jobs"
    print(submit_job_response.json())

jtyoung84 commented 5 months ago

Where does the background script run?

Maybe we don't need it at all? If I understand correctly, now the upload job fails if any folder specified in .csv is empty. Can we make it retry instead of failing in this case?

For 2, the aind-data-transfer-service does some validation checks and creates a job request that can be sent to the Slurm cluster running on the HPC. Currently, each team handles submitting the upload job request. Some have automated the process I believe and others are doing it manually via the aind-data-transfer-service web app. We are looking into setting a file watcher that can trigger automatically whenever files land in VAST, but that feature might not be available for a while.

Also for 2, I think it's the ephys folder in particular that has issues since there is an automated compression done on ephys modalities. @dyf I don't think we should proceed with an upload if there is an error compressing the ephys folder. It's possible to turn the compression step off via the upload job configs though. From your end, you can add a check if the folder is empty then to turn the compress_raw_data off for that modality. However, we want to avoid uploading uncompressed data.

hagikent commented 5 months ago

One issue is that all data modalities should be uploaded together. Proposal: If its behavior ONLY, not FIP or ephys, then the behavior code triggers the upload. Otherwise, the neural data code needs to trigger the upload for the behavior and neural data.

@alexpiet replying to the original notes; I'm not sure if it's the best idea to transfer data at the end of each session. In particular with photometry, and soon with high-speed videography, the data size of a single session won't be very light. This means that DAQ->VAST transfer could happen simultaneously with the execution of the following session. Considering this, currently I transfer data at the end of the day (23:59) so DAQ-PCs are not used for anything else.

Remaining problems: How can the behavior GUI alone generate a full session.json that includes metadata from other modalities?

@hanhou I was thinking of this in the context of Ephys+Ophys (+behavior). We probably need a 2-step approach – each system produces a part of session.json, and a concatenator script ad-hoc aggregates them into a master session.json file.

alexpiet commented 5 months ago

One issue is that all data modalities should be uploaded together. Proposal: If its behavior ONLY, not FIP or ephys, then the behavior code triggers the upload. Otherwise, the neural data code needs to trigger the upload for the behavior and neural data.

@alexpiet replying to the original notes; I'm not sure if it's the best idea to transfer data at the end of each session. In particular with photometry, and soon with high-speed videography, the data size of a single session won't be very light. This means that DAQ->VAST transfer could happen simultaneously with the execution of the following session. Considering this, currently I transfer data at the end of the day (23:59) so DAQ-PCs are not used for anything else.

For the photometry, or ephys data, this makes total sense to do data upload at the end of the day. For pure behavior, the data transfer should be very fast and always finish before the next session, right?

hagikent commented 5 months ago

might not be the case with "soon with high-speed videography" even for pure behavior sessions

alexpiet commented 5 months ago

might not be the case with "soon with high-speed videography" even for pure behavior sessions

Hmm, good point. @hanhou What do you think? Should we do all data uploads at some time in the evening?

hanhou commented 5 months ago

Yes, good point. Looks like uploading data in the evening at once may be easier.

hanhou commented 5 months ago

So an updated proposal:

Before a session, in the behavior GUI (or somewhere else?), the experimenter selects all modalities that will be used.

At the end of the session,

each system creates their own partial metadata such as session_behavior.json, session_FIP.json, and session_ephys.json.

the behavior GUI adds a new entry to a local file data_upload_queue.json that specifies subject_id, session_date, together with paths_to_raw_data for all modalities of this session. paths_to_raw_data is automatically generated for each modality by predefined conventions. Assuming our current data structure, an example entry could be like:

{
"subject_id": "00000",
"session_date": "2024-01-01",
"data_upload_status": "pending",
"data_upload_date": "",
"data_upload_log": "",
"paths_to_raw_data":[
    {
        "modality": "behavior",
        "paths": [
            "//{Behavior_PC}/{behavior_root}/{subject_id}_{session_date}/TrainingFolder/",
            "//{Behavior_PC}/{behavior_root}/{subject_id}_{session_date}/HarpFolder/"
        ]
    },
    {
        "modality": "video",
        "paths": [
            "//{Behavior_PC}/{behavior_root}/{subject_id}_{session_date}/VideoFolder/"
        ]
    },
    {
        "modality": "FIP",
        "paths": [
            "//{Behavior_PC}/{behavior_root}/{subject_id}_{session_date}/PhotometryFolder/"
        ]
    },
    {
        "modality": "ephys",
        "paths": [
            "//{Ephys_PC}/{ephys_root}/{subject_id}_{session_date}/"
        ]
    }
]
}

up to now, no data will be transferred to VAST

At the end of each day, a scheduled job running on the behavior computer:
1. loops over data_upload_queue.json and filters out all new sessions of the day (data_upload_status == "pending").
2. for each session:
  - if any specified data path is empty, set data_upload_status to "data missing", and abort
  - else:
    1. loop over paths_to_raw_data and copy each modality's data to VAST.
    2. merge all partial meta data to session.json and copy to VAST.
    3. generate the csv and trigger uploading to cloud
    4. set data_upload_status to done
(nice to have) The next morning, subscribed users receive an email notification about how many new sessions were successfully uploaded, which sessions failed, and why.

This workflow assumes a central role of the behavior PC (specifying modalities and running the scheduled job). It kind of makes sense if we decide to use behavior_ as platform for all data that contains behavior.

Any thoughts? @dyf @jtyoung84 @hagikent @alexpiet @XX-Yin @jsiegle

hagikent commented 5 months ago

Thanks Han for summarizing the proposal. It looks great. An expected tricky edge condition would be when data from a single session are recorded by multiple PCs, where behavior PC cannot "see" some data, and the session folder has to be composed at VAST level, after each PC individually uploads data.

Otherwise, I foresee the suggested workflow would work.

alexpiet commented 5 months ago

Thanks @hanhou, this approach makes sense to me.

I think an email notification is a very good idea, since we don't want things to fall through the cracks as they scale. (or at least an alert when something didn't work)
At what point, and how often do we delete the data from the computer that collected it?

hanhou commented 5 months ago

behavior PC cannot "see" some data, and the session folder has to be composed at VAST level

In my example data_upload_queue.json file, I allowed raw data to be on different PCs (see path for ephys). Behavior PC should be able to copy data on any remote machines to VAST (through network share or ssh).

hanhou commented 5 months ago

how often do we delete the data from the computer that collected it?

Let the experimenter decide? Or automatically delete when: 1. data is successfully uploaded, and 2. free disk space is less than some threshold?

hagikent commented 5 months ago

//{Ephys_PC}/

Ah, I see 👍

jsiegle commented 5 months ago

I think this plan sounds good. Ideally for any modalities that are selected before the start of the experiment, the behavior GUI will send triggers to stop and start recording at the appropriate times. This is easy to do for ephys, I'm assuming it will be for photometry as well.

alexpiet commented 5 months ago

how often do we delete the data from the computer that collected it?

Let the experimenter decide? Or automatically delete when: 1. data is successfully uploaded, and 2. free disk space is less than some threshold?

I think we should give a buffer after data upload in case some error is discovered during processing in CO. So I would side closer to option 2. Maybe something like delete after one week?

bruno-f-cruz commented 4 months ago

@alexpiet Here is a small script that can be used to hit the api directly without using the webapp. I think it might be possible to avoid creating the csv file directly and just use the BasicUploadJobConfigs class directly, but I'll look into it later. For now, this script assumes the csv file exists and can be read.


import requests

import json

# the dev domain is useful for testing things

# domain = "http://aind-data-transfer-service-dev"

# The following is the prod endpoint

domain = "http://aind-data-transfer-service" 

# Example File. Can be created and stored locally

# without copying to VAST, as long as it is accessible

# from wherever the automated script is running

# I think an additional validation check is done, so it may

# be possible to skip this step and create the json

# object directly.

csv_file_path = "./tests/resources/sample.csv"

with open(csv_file_path, "rb") as f:

  files = {

    "file": f,

  }

  validate_csv_response = requests.post(url=f"{domain}/api/validate_csv", files=files)

if validate_csv_response.status_code != 200:

  # There was an error validating data (406) or a server error (500)

  print(validate_csv_response.json())

else:

  # The response validates the contents and returns the parsed contents as json

  # It's possible to skip the validate csv step and create the json object directly

  # Using the BasicUploadJobConfigs class. I'll explore it later.

  upload_job_configs = validate_csv_response.json()["data"]["jobs"]

  # Can optionally add user email and fine tune hpc configs

  # The default hpc configs should suffice for most jobs

  hpc_settings = json.dumps({})

  submit_jobs_request = {"jobs": [{"hpc_settings": hpc_settings, "upload_job_settings": j} for j in upload_job_configs]}

  # It looks like this does an additional validation check, so it might be possible to skip the csv file creation

  submit_job_response = requests.post(

    url=f"{domain}/api/submit_hpc_jobs", json=submit_jobs_request

  )

  if submit_job_response.status_code != 200:

    # An error occurred sending the request to the HPC

    print(submit_job_response.json())

  else:

    # The request was sent to the HPC. It's possible the hpc

    # may run into errors. For now, the logs get stored in

    # /allen/aind/scratch/svc_aind_upload/logs/prod/{asset_name}.out and

    # /allen/aind/scratch/svc_aind_upload/logs/prod/{asset_name}_error.out

    # Future work will be to make the logs more accessible, at least the

    # errors. The status of the job can also be viewed at:

    # f"{domain}/jobs"

    print(submit_job_response.json())

Looking to implement this in the next few days. Is there any way to bypass the csv altogether? Ie add the information necessary to copy a single session assets to the rest api call? Thanks!

jtyoung84 commented 4 months ago

Looking to implement this in the next few days. Is there any way to bypass the csv altogether? Ie add the information necessary to copy a single session assets to the rest api call? Thanks!

The "validate_csv" converts the csv file into this model: BasicJobConfigs The web app also attaches some additional settings based on the form data. When you're ready to start looking into it, I can help write the request directly.

alexpiet commented 3 months ago

Notes from meeting with @dyf @hanhou @hagikent @saskiad

Behavior software saves a session.json file, then neural data modalities add information as needed
This happens before data modality

hanhou commented 3 months ago

Notes from physiology pipeline hackathon:

macarenasa commented 2 months ago

4/30 -

@alexpiet and @bruno-f-cruz - provide session data/metadata to @dyf
Metdata definition is the main blocker.

XX-Yin commented 2 months ago

@dyf The pull request is https://github.com/AllenNeuralDynamics/dynamic-foraging-task/pull/405.

An example behavior json: 715083_2024-04-22_14-32-07.json An example rig metadata: rig323_EPHYS3_2024-03-30.json An example session metadata generated from the GUI: session.json

alexpiet commented 2 months ago

Notes from discussion with @bruno-f-cruz @dyf @JeremiahYCohen @cindypoo

Acquisition computer creates:

data
session.json
rig.json (?)

Then acquisition computer uses the aind-watchdog-service that:

copies data to VAST
provides a check when data transfer is complete/accurate
triggers upload to AWS using aind-data-transfer-service
repo: https://github.com/AllenNeuralDynamics/aind-watchdog-service

aind-data-transfer-service

copy data to AWS
fetch non-session metadata, including subject.json, and surgical information

macarenasa commented 2 months ago

[like] Macarena Aloi reacted to your message:

From: Alex Piet @.> Sent: Friday, May 3, 2024 8:09:19 PM To: AllenNeuralDynamics/dynamic-foraging-task @.> Cc: Macarena Aloi @.>; Comment @.> Subject: Re: [AllenNeuralDynamics/dynamic-foraging-task] Data upload notes (meeting with David and Jon) (Issue #209)

Notes from discussion with @bruno-f-cruzhttps://github.com/bruno-f-cruz @dyfhttps://github.com/dyf @JeremiahYCohenhttps://github.com/JeremiahYCohen @cindypoohttps://github.com/cindypoo

Acquisition computer creates:

data
session.json
rig.json (?)

Then acquisition computer uses the aind-watchdog-service that:

copies data to VAST
provides a check when data transfer is complete/accurate
triggers upload to AWS using aind-data-transfer-service
repo: https://github.com/AllenNeuralDynamics/aind-watchdog-service

aind-data-transfer-service

copy data to AWS
fetch non-session metadata, including subject.json, and surgical information

— Reply to this email directly, view it on GitHubhttps://github.com/AllenNeuralDynamics/dynamic-foraging-task/issues/209#issuecomment-2093691613, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A237N6DQYXCRETSD5LF6AJLZAPVG7AVCNFSM6AAAAABCSDQS22VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTGY4TCNRRGM. You are receiving this because you commented.Message ID: @.***>

hagikent commented 2 months ago

@alexpiet bumped into Jon and he says aind-watchdog-service is not yet fully deployed/ready. We should talk to Arielle(?), who develops this, about timeline. For now, 446 PCs might want to start with the current robocopy transfer.

alexpiet commented 2 months ago

@hagikent I just asked Arielle about timeline, will keep you posted

AllenNeuralDynamics / dynamic-foraging-task

Data upload notes (meeting with David and Jon) #209