NHMDenmark / DaSSCo-Integration

This Repo will include integration of dassco storage from northtec
0 stars 0 forks source link

MongoDB track structure #24

Open Baeist opened 7 months ago

Baeist commented 7 months ago

One database called FirstTest, Has 4 collections in it: Metadata, track, batch_list, slurm_list.

Metadata contains each metadata.json saved with the addition of _id being {guid}.

track contains one entry for each asset and has the following possible fields:

_id : {guid} pipeline: {pipeline} job_list : Array(x) { name : {job name}, status : {job status}, priority: {job priority}, timestamp : {last status update}, slurm_job_id : {default = -1} }, {etc ... } transfer_link : {ARS link to img file} img_checksum: {calculated checksum for img from ndrive, optional} is_on_slurm : {bool} batch_listname: {workstation + + date asset created}

batch_list contains entries that each has a list of assets belonging to that particular batch.

id: {workstation + + date asset created} guids: Array(x) 0: {guid} 1: etc

slurm_list contains a single entry with every asset that has its metadata.json currently on slurm

_id: is_on_slurm guids: Array(x) 0: {guid} 1: etc

Baeist commented 7 months ago
Track entry: Field Type Example Comment
_id string "7e6-8-02-0d-1b-15-0-000-00-000-053ea4-00000" The same as the assets guid
pipeline string "PIPEHERB0001 The same as the assets pipeline name
created_timestamp date 2024-02-26T08:36:11.527+00:00 The timestamp for when the entry was created.
batch_list_name string "ti-ws-01_2022-08-02" Comprised of the date the asset was taken and the workstation name. Is used for helping with mos/mso
job_list array Array(2) Contains the information pertaining to individual jobs
is_in_ars string "YES" Uses the validate enum to show if an asset has been fully readied through NT api
is_on_hpc string "NO" Uses the validate enum to show if an asset has been fully transferred to the hpc server
jobs_status string "WAITING" Uses the status enum to show the overall status for the job flow. Most active job defines the status.
ars_file_link string https://storage.test.dassco.dk/file_proxy/api/assetfiles/test-institution/test-collection/asset_18/test.tif This points to a file that is in an open share. This should be changed to a list of links.
image_check_sum integer 4032106524 CRC checksum for the file in the link. This should be updated to a list when the links are.
Job objects in the job_list: Field Type Example Comment
name string "label" The name of the job
status string "DONE" Uses the status enum to show status of an individual job
priority integer 1 Used for knowing in which order job scripts are called. Job order is set in the pipeline_job_config file
job_started_time date null Datetime stamp for when a job was queued.
hpc_job_id integer -9 Default value is -9 before a job has been queued on the hpc cluster.
Baeist commented 6 months ago
Track entry: Field Type Example Comment
_id string "7e6-8-02-0d-1b-15-0-000-00-000-053ea4-00000" The same as the assets guid
pipeline string "PIPEHERB0001 The same as the assets pipeline name
created_timestamp date 2024-02-26T08:36:11.527+00:00 The timestamp for when the entry was created.
batch_list_name string "ti-ws-01_2022-08-02" Comprised of the date the asset was taken and the workstation name. Is used for helping with mos/mso
job_list array Array(2) Contains the information pertaining to individual jobs
is_in_ars string "YES" Uses the validate enum to show if an asset has been fully readied through NT api
is_on_hpc string "NO" Uses the validate enum to show if an asset has been fully transferred to the hpc server
jobs_status string "WAITING" Uses the status enum to show the overall status for the job flow. Most active job defines the status.
ars_file_link string https://storage.test.dassco.dk/file_proxy/api/assetfiles/test-institution/test-collection/asset_18/test.tif This points to a file that is in an open share. This should be changed to a list of links.
image_check_sum integer 4032106524 CRC checksum for the file in the link. This should be updated to a list when the links are.
erda_sync string "YES" Uses the validate enum to show an assets current sync status with erda.
update_metadata string "NO" Uses the validate enum to show if an asset needs to have its metadata updated via the storage api.
Job objects in the job_list: Field Type Example Comment
name string "label" The name of the job
status string "DONE" Uses the status enum to show status of an individual job
priority integer 1 Used for knowing in which order job scripts are called. Job order is set in the pipeline_job_config file
job_started_time date null Datetime stamp for when a job was queued.
hpc_job_id integer -9 Default value is -9 before a job has been queued on the hpc cluster.
Baeist commented 6 months ago
Track entry: Field Type Example Comment
_id string "7e6-8-02-0d-1b-15-0-000-00-000-053ea4-00000" The same as the assets guid
created_timestamp date 2024-02-26T08:36:11.527+00:00 The timestamp for when the entry was created.
pipeline string "PIPEHERB0001 The same as the assets pipeline name
batch_list_name string "ti-ws-01_2022-08-02" Comprised of the date the asset was taken and the workstation name. Is used for helping with mos/mso
job_list array Array(2) Contains the information pertaining to individual jobs
jobs_status string "WAITING" Uses the status enum to show the overall status for the job flow. Most active job defines the status.
file_list array Array(1) Contains information for each file in an asset except the metadata file
files_status string "NONE" Uses the status enum to give an overall status for the files.
asset_size int 610 The total size of all the files an asset is comprised of in MB (note different from MiB).
hpc_ready string "NO" Uses the validate enum to show if an asset has been fully transferred to the hpc server
is_in_ars string "YES" Uses the validate enum to show if an asset has been fully readied through NT api
has_new_file string "YES" Uses the validate enum to show if new files have been added to the asset.
has_open_share string "YES" Uses the validate enum to show if an asset has an open share.
erda_sync string "YES" Uses the validate enum to show an assets current sync status with erda.
update_metadata string "NO" Uses the validate enum to show if an asset needs to have its metadata updated via the storage api.
Job objects in the job_list: Field Type Example Comment
name string "label" The name of the job.
status string "DONE" Uses the status enum to show status of an individual job
priority integer 1 Used for knowing in which order job scripts are called. Job order is set in the pipeline_job_config file
job_started_time date null Datetime stamp for when a job was queued.
hpc_job_id integer -9 Default value is -9 before a job has been queued on the hpc cluster.
File objects in the file_list: Field Type Example Comment
name string "4324-0973-34.jpg" The file name.
type string "txt" The file extension (tif, jpg etc).
time_added date null Datetime stamp for when a file was added to the asset in integration servers eyes.
check_sum integer 1408123129 The CRC checksum of the file.
file_size integer 610 Filesize in MB.
ars_link string "https://link/to/fileproxy.tif" When available will have the link for the file on the file proxy server.
erda_sync string "YES" Uses the validate enum to show if the file has been synced with erda.
deleted string "NO" Uses the validate enum to show if the file has been deleted.
Baeist commented 6 months ago

Name change to track structures to reflect that there are other databases.

Baeist commented 5 months ago
Add proxy_path Track entry: Field Type Example Comment
_id string "7e6-8-02-0d-1b-15-0-000-00-000-053ea4-00000" The same as the assets guid
created_timestamp date 2024-02-26T08:36:11.527+00:00 The timestamp for when the entry was created.
pipeline string "PIPEHERB0001 The same as the assets pipeline name
batch_list_name string "ti-ws-01_2022-08-02" Comprised of the date the asset was taken and the workstation name. Is used for helping with mos/mso
job_list array Array(2) Contains the information pertaining to individual jobs
jobs_status string "WAITING" Uses the status enum to show the overall status for the job flow. Most active job defines the status.
file_list array Array(1) Contains information for each file in an asset except the metadata file
files_status string "NONE" Uses the status enum to give an overall status for the files.
asset_size int 610 The total size of all the files an asset is comprised of in MB (note different from MiB).
proxy_path string https://fileproxy/api/path/assetguid/something The path to the current open fileshare for an asset.
hpc_ready string "NO" Uses the validate enum to show if an asset has been fully transferred to the hpc server
is_in_ars string "YES" Uses the validate enum to show if an asset has been fully readied through NT api
has_new_file string "YES" Uses the validate enum to show if new files have been added to the asset.
has_open_share string "YES" Uses the validate enum to show if an asset has an open share.
erda_sync string "YES" Uses the validate enum to show an assets current sync status with erda.
update_metadata string "NO" Uses the validate enum to show if an asset needs to have its metadata updated via the storage api.
Job objects in the job_list: Field Type Example Comment
name string "label" The name of the job.
status string "DONE" Uses the status enum to show status of an individual job
priority integer 1 Used for knowing in which order job scripts are called. Job order is set in the pipeline_job_config file
job_started_time date null Datetime stamp for when a job was queued.
hpc_job_id integer -9 Default value is -9 before a job has been queued on the hpc cluster.
File objects in the file_list: Field Type Example Comment
name string "4324-0973-34.jpg" The file name.
type string "txt" The file extension (tif, jpg etc).
time_added date null Datetime stamp for when a file was added to the asset in integration servers eyes.
check_sum integer 1408123129 The CRC checksum of the file.
file_size integer 610 Filesize in MB.
ars_link string "https://link/to/fileproxy.tif" When available will have the link for the file on the file proxy server.
erda_sync string "YES" Uses the validate enum to show if the file has been synced with erda.
deleted string "NO" Uses the validate enum to show if the file has been deleted.