Open alexpiet opened 5 months ago
Example csv file for upload definitions test_for_data_upload.csv
We need a way to specify all modalities used in a session so that the uploading script knows what modalities are supposed to be in VAST before triggering the upload. This is also a part of generating session.json
metadata.
One possibility may be:
behavior only
on the training rigs. This will be reflected in the field data_stream
of session.json
session.json
and copys it together with the behavioral data to VAST.session.json
are ready in VAST. If yes, trigger the upload.Remaining problems:
session.json
that includes metadata from other modalities?@jtyoung84 @dyf any ideas?
- Need to agree upon the control vocabulary. If we have behavior and ephys, is this "ecephys", or "behavior"? Either is fine, but we should be consistent. These labels are just used to search for data.
How about this?
I propose giving behavior_
a higher priority because of the logical continuity from "behavior" to "behavior + ephys/ophys" sessions (e.g., tracked by the same behavior curriculum manager), while the connection between "ephys" to "behavior + ephys" sessions is somewhat weaker.
@alexpiet Here is a small script that can be used to hit the api directly without using the webapp. I think it might be possible to avoid creating the csv file directly and just use the BasicUploadJobConfigs class directly, but I'll look into it later. For now, this script assumes the csv file exists and can be read.
import requests
import json
# the dev domain is useful for testing things
# domain = "http://aind-data-transfer-service-dev"
# The following is the prod endpoint
domain = "http://aind-data-transfer-service"
# Example File. Can be created and stored locally
# without copying to VAST, as long as it is accessible
# from wherever the automated script is running
# I think an additional validation check is done, so it may
# be possible to skip this step and create the json
# object directly.
csv_file_path = "./tests/resources/sample.csv"
with open(csv_file_path, "rb") as f:
files = {
"file": f,
}
validate_csv_response = requests.post(url=f"{domain}/api/validate_csv", files=files)
if validate_csv_response.status_code != 200:
# There was an error validating data (406) or a server error (500)
print(validate_csv_response.json())
else:
# The response validates the contents and returns the parsed contents as json
# It's possible to skip the validate csv step and create the json object directly
# Using the BasicUploadJobConfigs class. I'll explore it later.
upload_job_configs = validate_csv_response.json()["data"]["jobs"]
# Can optionally add user email and fine tune hpc configs
# The default hpc configs should suffice for most jobs
hpc_settings = json.dumps({})
submit_jobs_request = {"jobs": [{"hpc_settings": hpc_settings, "upload_job_settings": j} for j in upload_job_configs]}
# It looks like this does an additional validation check, so it might be possible to skip the csv file creation
submit_job_response = requests.post(
url=f"{domain}/api/submit_hpc_jobs", json=submit_jobs_request
)
if submit_job_response.status_code != 200:
# An error occurred sending the request to the HPC
print(submit_job_response.json())
else:
# The request was sent to the HPC. It's possible the hpc
# may run into errors. For now, the logs get stored in
# /allen/aind/scratch/svc_aind_upload/logs/prod/{asset_name}.out and
# /allen/aind/scratch/svc_aind_upload/logs/prod/{asset_name}_error.out
# Future work will be to make the logs more accessible, at least the
# errors. The status of the job can also be viewed at:
# f"{domain}/jobs"
print(submit_job_response.json())
Where does the background script run?
- Maybe we don't need it at all? If I understand correctly, now the upload job fails if any folder specified in .csv is empty. Can we make it retry instead of failing in this case?
For 2, the aind-data-transfer-service does some validation checks and creates a job request that can be sent to the Slurm cluster running on the HPC. Currently, each team handles submitting the upload job request. Some have automated the process I believe and others are doing it manually via the aind-data-transfer-service web app. We are looking into setting a file watcher that can trigger automatically whenever files land in VAST, but that feature might not be available for a while.
Also for 2, I think it's the ephys folder in particular that has issues since there is an automated compression done on ephys modalities. @dyf I don't think we should proceed with an upload if there is an error compressing the ephys folder. It's possible to turn the compression step off via the upload job configs though. From your end, you can add a check if the folder is empty then to turn the compress_raw_data off for that modality. However, we want to avoid uploading uncompressed data.
One issue is that all data modalities should be uploaded together. Proposal: If its behavior ONLY, not FIP or ephys, then the behavior code triggers the upload. Otherwise, the neural data code needs to trigger the upload for the behavior and neural data.
@alexpiet replying to the original notes; I'm not sure if it's the best idea to transfer data at the end of each session. In particular with photometry, and soon with high-speed videography, the data size of a single session won't be very light. This means that DAQ->VAST transfer could happen simultaneously with the execution of the following session. Considering this, currently I transfer data at the end of the day (23:59) so DAQ-PCs are not used for anything else.
Remaining problems: How can the behavior GUI alone generate a full session.json that includes metadata from other modalities?
@hanhou
I was thinking of this in the context of Ephys+Ophys (+behavior). We probably need a 2-step approach ā each system produces a part of session.json
, and a concatenator script ad-hoc aggregates them into a master session.json
file.
One issue is that all data modalities should be uploaded together. Proposal: If its behavior ONLY, not FIP or ephys, then the behavior code triggers the upload. Otherwise, the neural data code needs to trigger the upload for the behavior and neural data.
@alexpiet replying to the original notes; I'm not sure if it's the best idea to transfer data at the end of each session. In particular with photometry, and soon with high-speed videography, the data size of a single session won't be very light. This means that DAQ->VAST transfer could happen simultaneously with the execution of the following session. Considering this, currently I transfer data at the end of the day (23:59) so DAQ-PCs are not used for anything else.
For the photometry, or ephys data, this makes total sense to do data upload at the end of the day. For pure behavior, the data transfer should be very fast and always finish before the next session, right?
might not be the case with "soon with high-speed videography" even for pure behavior sessions
might not be the case with "soon with high-speed videography" even for pure behavior sessions
Hmm, good point. @hanhou What do you think? Should we do all data uploads at some time in the evening?
Yes, good point. Looks like uploading data in the evening at once may be easier.
So an updated proposal:
session_behavior.json
, session_FIP.json
, and session_ephys.json
.data_upload_queue.json
that specifies subject_id
, session_date
, together with paths_to_raw_data
for all modalities of this session. paths_to_raw_data
is automatically generated for each modality by predefined conventions. Assuming our current data structure, an example entry could be like:
{
"subject_id": "00000",
"session_date": "2024-01-01",
"data_upload_status": "pending",
"data_upload_date": "",
"data_upload_log": "",
"paths_to_raw_data":[
{
"modality": "behavior",
"paths": [
"//{Behavior_PC}/{behavior_root}/{subject_id}_{session_date}/TrainingFolder/",
"//{Behavior_PC}/{behavior_root}/{subject_id}_{session_date}/HarpFolder/"
]
},
{
"modality": "video",
"paths": [
"//{Behavior_PC}/{behavior_root}/{subject_id}_{session_date}/VideoFolder/"
]
},
{
"modality": "FIP",
"paths": [
"//{Behavior_PC}/{behavior_root}/{subject_id}_{session_date}/PhotometryFolder/"
]
},
{
"modality": "ephys",
"paths": [
"//{Ephys_PC}/{ephys_root}/{subject_id}_{session_date}/"
]
}
]
}
data_upload_queue.json
and filters out all new sessions of the day (data_upload_status == "pending"
).data_upload_status
to "data missing"
, and abortpaths_to_raw_data
and copy each modality's data to VAST. session.json
and copy to VAST.data_upload_status
to done
This workflow assumes a central role of the behavior PC (specifying modalities and running the scheduled job). It kind of makes sense if we decide to use behavior_
as platform
for all data that contains behavior.
Any thoughts? @dyf @jtyoung84 @hagikent @alexpiet @XX-Yin @jsiegle
Thanks Han for summarizing the proposal. It looks great. An expected tricky edge condition would be when data from a single session are recorded by multiple PCs, where behavior PC cannot "see" some data, and the session folder has to be composed at VAST level, after each PC individually uploads data.
Otherwise, I foresee the suggested workflow would work.
Thanks @hanhou, this approach makes sense to me.
behavior PC cannot "see" some data, and the session folder has to be composed at VAST level
In my example data_upload_queue.json
file, I allowed raw data to be on different PCs (see path
for ephys). Behavior PC should be able to copy data on any remote machines to VAST (through network share or ssh).
- how often do we delete the data from the computer that collected it?
Let the experimenter decide? Or automatically delete when: 1. data is successfully uploaded, and 2. free disk space is less than some threshold?
//{Ephys_PC}/
Ah, I see š
I think this plan sounds good. Ideally for any modalities that are selected before the start of the experiment, the behavior GUI will send triggers to stop and start recording at the appropriate times. This is easy to do for ephys, I'm assuming it will be for photometry as well.
- how often do we delete the data from the computer that collected it?
Let the experimenter decide? Or automatically delete when: 1. data is successfully uploaded, and 2. free disk space is less than some threshold?
I think we should give a buffer after data upload in case some error is discovered during processing in CO. So I would side closer to option 2. Maybe something like delete after one week?
@alexpiet Here is a small script that can be used to hit the api directly without using the webapp. I think it might be possible to avoid creating the csv file directly and just use the BasicUploadJobConfigs class directly, but I'll look into it later. For now, this script assumes the csv file exists and can be read.
import requests import json # the dev domain is useful for testing things # domain = "http://aind-data-transfer-service-dev" # The following is the prod endpoint domain = "http://aind-data-transfer-service" # Example File. Can be created and stored locally # without copying to VAST, as long as it is accessible # from wherever the automated script is running # I think an additional validation check is done, so it may # be possible to skip this step and create the json # object directly. csv_file_path = "./tests/resources/sample.csv" with open(csv_file_path, "rb") as f: files = { "file": f, } validate_csv_response = requests.post(url=f"{domain}/api/validate_csv", files=files) if validate_csv_response.status_code != 200: # There was an error validating data (406) or a server error (500) print(validate_csv_response.json()) else: # The response validates the contents and returns the parsed contents as json # It's possible to skip the validate csv step and create the json object directly # Using the BasicUploadJobConfigs class. I'll explore it later. upload_job_configs = validate_csv_response.json()["data"]["jobs"] # Can optionally add user email and fine tune hpc configs # The default hpc configs should suffice for most jobs hpc_settings = json.dumps({}) submit_jobs_request = {"jobs": [{"hpc_settings": hpc_settings, "upload_job_settings": j} for j in upload_job_configs]} # It looks like this does an additional validation check, so it might be possible to skip the csv file creation submit_job_response = requests.post( url=f"{domain}/api/submit_hpc_jobs", json=submit_jobs_request ) if submit_job_response.status_code != 200: # An error occurred sending the request to the HPC print(submit_job_response.json()) else: # The request was sent to the HPC. It's possible the hpc # may run into errors. For now, the logs get stored in # /allen/aind/scratch/svc_aind_upload/logs/prod/{asset_name}.out and # /allen/aind/scratch/svc_aind_upload/logs/prod/{asset_name}_error.out # Future work will be to make the logs more accessible, at least the # errors. The status of the job can also be viewed at: # f"{domain}/jobs" print(submit_job_response.json())
Looking to implement this in the next few days. Is there any way to bypass the csv altogether? Ie add the information necessary to copy a single session assets to the rest api call? Thanks!
Looking to implement this in the next few days. Is there any way to bypass the csv altogether? Ie add the information necessary to copy a single session assets to the rest api call? Thanks!
The "validate_csv" converts the csv file into this model: BasicJobConfigs The web app also attaches some additional settings based on the form data. When you're ready to start looking into it, I can help write the request directly.
Notes from meeting with @dyf @hanhou @hagikent @saskiad
Notes from physiology pipeline hackathon:
4/30 -
@dyf The pull request is https://github.com/AllenNeuralDynamics/dynamic-foraging-task/pull/405.
An example behavior json: 715083_2024-04-22_14-32-07.json An example rig metadata: rig323_EPHYS3_2024-03-30.json An example session metadata generated from the GUI: session.json
Notes from discussion with @bruno-f-cruz @dyf @JeremiahYCohen @cindypoo
Acquisition computer creates:
Then acquisition computer uses the aind-watchdog-service
that:
aind-data-transfer-service
aind-data-transfer-service
[like] Macarena Aloi reacted to your message:
From: Alex Piet @.> Sent: Friday, May 3, 2024 8:09:19 PM To: AllenNeuralDynamics/dynamic-foraging-task @.> Cc: Macarena Aloi @.>; Comment @.> Subject: Re: [AllenNeuralDynamics/dynamic-foraging-task] Data upload notes (meeting with David and Jon) (Issue #209)
Notes from discussion with @bruno-f-cruzhttps://github.com/bruno-f-cruz @dyfhttps://github.com/dyf @JeremiahYCohenhttps://github.com/JeremiahYCohen @cindypoohttps://github.com/cindypoo
Acquisition computer creates:
Then acquisition computer uses the aind-watchdog-service that:
aind-data-transfer-service
ā Reply to this email directly, view it on GitHubhttps://github.com/AllenNeuralDynamics/dynamic-foraging-task/issues/209#issuecomment-2093691613, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A237N6DQYXCRETSD5LF6AJLZAPVG7AVCNFSM6AAAAABCSDQS22VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTGY4TCNRRGM. You are receiving this because you commented.Message ID: @.***>
@alexpiet
bumped into Jon and he says aind-watchdog-service
is not yet fully deployed/ready.
We should talk to Arielle(?), who develops this, about timeline. For now, 446 PCs might want to start with the current robocopy transfer.
@hagikent I just asked Arielle about timeline, will keep you posted
We can use the Sci. Comp data transfer service to automate data uploading to AWS.
If you want to manually upload some data, you can:
.csv
file that containsTo automate this, use the REST API (https://github.com/AllenNeuralDynamics/aind-data-transfer-service)
Behavior specific notes