Closed vortexing closed 6 years ago
OK, I will look into this. Might ping you if I have questions.
So, there's a couple things about the way this works that might be helpful to explain.
Currently (as you figured out) the program doesn't know what to do with arbitrary columns that you add at the end. It only understands the ones that tell it what to do (seq_dir
, s3transferbucket
, s3_prefix
, and data_type
) and the hardcoded tag values (molecular_id
, assay_material_id
, stage
, and omics_sample_name
). I think allowing arbitrary columns/tag names would be a good idea. So keeping the set of columns that tell the program what to do, but the rest could be anything. And allowing the columns to be in any order would also probably be good (but for tidiness to humans it would probably be good to have all the ones that tell the program what to do first, and then the tag name ones after). This feature seems sensible and not too hard to implement.
The way it currently tags things is "write-only"; that is, it does not look at existing values at all. Whatever is in the CSV will always completely clobber the existing tags. Part of this is because AWS does not provide a way to change just a single tag at a time without affecting the others. Tagging always overwrites all of the existing tags. So making a change like "only overwrite tags that have a new value, leave blank ones alone" will impose a cost (in both performance and complexity). You would need to:
This feature seems more complicated and I am not sure if it should be done but I'll think about it some more and I could maybe be persuaded to change my mind.
If you need to update just certain tags, you could always run the get-s3-tags tool and create a new csv from its output, then just modify the cells that need changing.
So if I implement the first change (allowing arbitrary column/tag names) but not the second (reading existing tags and merging with what's in the CSV) would that give you what you need?
I wholeheartedly agree with the first change (esp re: making all the essential columns first, then however many tags you want to show up after that).
And the overwrite-all-tags thing was how I'd assumed S3 tagging worked but didn't know if that was just naive. Currently I DO have the output from the get-s3-tags tool so as long as I know whatever I put up there will set the tags to only what I include, then that is 100% fine and the rest of the work to get editing specific tags to work is totally not worth your time.
COOL!!!
Fixed. Please (re-)read the README as the usage instructions have changed.
Looks great! Question for the readme: When only re-tagging and not uploading, you use the flag but also what do you put in the first column (seq_dir, or the directory/file in fast)? Just leave it blank? Can you leave it out? I did this a couple of times but I forget what I did.
Tried it with this csv, (and tried it once with the seq_dir empty b/c these are already in S3 and no longer have a copy in fast), but it said I had missing columns. I triple checked, but can't see a problem. Am I misinterpreting the instructions?
Change s3_transferbucket to s3transferbucket. No underscore.
OMG, where's the eyeroll button?
However, I fixed it, tried it again and it gave me the same error. Manifest I'm attempting to use is in my fast (/paguirigan_a/trgen/s3-archive-txfer/18-07-27-tags2edit.csv).
Took me a while to figure this out but there are some weird invisible unicode characters at the very beginning of your CSV which meant that seq_dir
did not match seq_dir
in the check for required columns. I will email you a fixed version of the csv. Not sure how that got in there.
I notice that the values under seq_dir are relative paths. This will work as long as you run the code from the right directory. Or maybe you are going to run with -t
in which case the values in that column do not matter.
Upon further review I have realized that when Globus writes files back to S3 it can add a tag called "workflowID". But now all the processed files in S3 that WEREN'T from Globus don't have any tag or tag value.
I have a list of keys in a bucket that are already in the bucket and just need to either append a key-value pair as a tag, OR I can overwrite all the tags. I did give it a shot just now and just appended the new tag as an additional column on the off chance it might work and it is returning "Number of ops: 0".
I wonder about a couple of things for updating tags in bulk:
stage
, do I need to still have the column in a certain place in the csv?Perhaps referring to specific columns by name might be better for modifying and correcting tags? Then figuring out if to update any of the tags, you realistically have to know ALL the tags and reset the entire set of tags at the same time would be useful.