Geo-omics / GLAMR_omics_pipelines

Multi-omics pipeline for the GLAMR database
6 stars 0 forks source link

"checkout" for pipeline output files to support downstream file consumption #8

Open robert102 opened 1 month ago

robert102 commented 1 month ago

Whenever the pipline finishes successfully writing a file intended for downstream consumption, I'd like it to append a line to a special "checkout" file denoting the last modification time and the relative path, resulting in a file e.g.:

2023-11-04 06:03:40.380944-04:00 metagenomes/samp_4452/assembly/megahit_noNORM/final.contigs.renamed.fa 2024-03-01 23:29:51.935972-05:00 metagenomes/samp_4452/samp_4452_lca_abund_summarized.tsv 2023-11-03 17:59:00.514943-04:00 metagenomes/samp_4453/assembly/megahit_noNORM/final.contigs.renamed.fa 2024-03-04 20:52:55.342476-05:00 metagenomes/samp_4453/samp_4453_lca_abund_summarized.tsv 2023-11-04 07:02:04.006604-04:00 metagenomes/samp_4454/assembly/megahit_noNORM/final.contigs.renamed.fa 2024-03-04 17:52:41.123889-05:00 metagenomes/samp_4454/samp_4454_lca_abund_summarized.tsv 2023-11-04 00:01:04.829823-04:00 metagenomes/samp_4455/assembly/megahit_noNORM/final.contigs.renamed.fa 2024-03-05 19:19:19.675924-05:00 metagenomes/samp_4455/samp_4455_lca_abund_summarized.tsv 2023-11-04 00:01:09.835734-04:00 metagenomes/samp_4456/assembly/megahit_noNORM/final.contigs.renamed.fa 2024-03-04 21:53:16.752722-05:00 metagenomes/samp_4456/samp_4456_lca_abund_summarized.tsv

The timestamp should be formatted (with python) via: str(datetime.fromtimestamp(Path('/path/to/file').stat().st_mtime).astimezone())

The path is relative to .../data/omics/

If the pipeline later overwrites a file, a new line denoting the new modtime is appended. If a file needs to be withdrawn this can be communicated by adding a line with an empty timestamp.

Motivation: allows concurrent writing/reading of the omics data

akiledal commented 1 month ago

Looks good, I'll work on adding this