adding a tag to existing entities in S3

vortexing commented 6 years ago

Upon further review I have realized that when Globus writes files back to S3 it can add a tag called "workflowID". But now all the processed files in S3 that WEREN'T from Globus don't have any tag or tag value.

I have a list of keys in a bucket that are already in the bucket and just need to either append a key-value pair as a tag, OR I can overwrite all the tags. I did give it a shot just now and just appended the new tag as an additional column on the off chance it might work and it is returning "Number of ops: 0".
screen shot 2018-07-21 at 3 24 10 pm

I wonder about a couple of things for updating tags in bulk:

If I want to add a tag and value that isn't there on entities for which I have the key, can I append it to the existing tags?
If I realized one of the existing 4 tags was wrong on a subset of entities, if I leave the values in the csv blank for those tags that are correct will it only overwrite those cells that have a new value?
If I wanted to just update a single tag and leave any existing tags unchanged, like stage, do I need to still have the column in a certain place in the csv?

Perhaps referring to specific columns by name might be better for modifying and correcting tags? Then figuring out if to update any of the tags, you realistically have to know ALL the tags and reset the entire set of tags at the same time would be useful.

dtenenba commented 6 years ago

OK, I will look into this. Might ping you if I have questions.

dtenenba commented 6 years ago

So, there's a couple things about the way this works that might be helpful to explain.

Currently (as you figured out) the program doesn't know what to do with arbitrary columns that you add at the end. It only understands the ones that tell it what to do (seq_dir, s3transferbucket, s3_prefix, and data_type) and the hardcoded tag values (molecular_id, assay_material_id, stage, and omics_sample_name). I think allowing arbitrary columns/tag names would be a good idea. So keeping the set of columns that tell the program what to do, but the rest could be anything. And allowing the columns to be in any order would also probably be good (but for tidiness to humans it would probably be good to have all the ones that tell the program what to do first, and then the tag name ones after). This feature seems sensible and not too hard to implement.
The way it currently tags things is "write-only"; that is, it does not look at existing values at all. Whatever is in the CSV will always completely clobber the existing tags. Part of this is because AWS does not provide a way to change just a single tag at a time without affecting the others. Tagging always overwrites all of the existing tags. So making a change like "only overwrite tags that have a new value, leave blank ones alone" will impose a cost (in both performance and complexity). You would need to:
- Get the existing tags on the object (this is not currently done)
- Use some logic/rules to merge the existing tags with the new ones from the CSV. This is where we could say that blank cells mean not to overwrite the existing tags, and if the value is different in the CSV, to update the tags. Anything that is the same in both the csv and the existing tags would need to be included in the merged tags.
- Finally, re-tag the object with the merged tags.

This feature seems more complicated and I am not sure if it should be done but I'll think about it some more and I could maybe be persuaded to change my mind.

If you need to update just certain tags, you could always run the get-s3-tags tool and create a new csv from its output, then just modify the cells that need changing.

So if I implement the first change (allowing arbitrary column/tag names) but not the second (reading existing tags and merging with what's in the CSV) would that give you what you need?

vortexing commented 6 years ago

I wholeheartedly agree with the first change (esp re: making all the essential columns first, then however many tags you want to show up after that).

And the overwrite-all-tags thing was how I'd assumed S3 tagging worked but didn't know if that was just naive. Currently I DO have the output from the get-s3-tags tool so as long as I know whatever I put up there will set the tags to only what I include, then that is 100% fine and the rest of the work to get editing specific tags to work is totally not worth your time.

COOL!!!

dtenenba commented 6 years ago

Fixed. Please (re-)read the README as the usage instructions have changed.

vortexing commented 6 years ago

Looks great! Question for the readme: When only re-tagging and not uploading, you use the flag but also what do you put in the first column (seq_dir, or the directory/file in fast)? Just leave it blank? Can you leave it out? I did this a couple of times but I forget what I did.

vortexing commented 6 years ago

Tried it with this csv, (and tried it once with the seq_dir empty b/c these are already in S3 and no longer have a copy in fast), but it said I had missing columns. I triple checked, but can't see a problem. Am I misinterpreting the instructions?

screen shot 2018-07-27 at 6 57 24 am

dtenenba commented 6 years ago

Change s3_transferbucket to s3transferbucket. No underscore.

vortexing commented 6 years ago

OMG, where's the eyeroll button?

However, I fixed it, tried it again and it gave me the same error. Manifest I'm attempting to use is in my fast (/paguirigan_a/trgen/s3-archive-txfer/18-07-27-tags2edit.csv).

screen shot 2018-07-27 at 8 58 03 am

dtenenba commented 6 years ago

Took me a while to figure this out but there are some weird invisible unicode characters at the very beginning of your CSV which meant that seq_dir did not match seq_dir in the check for required columns. I will email you a fixed version of the csv. Not sure how that got in there.

I notice that the values under seq_dir are relative paths. This will work as long as you run the code from the right directory. Or maybe you are going to run with -t in which case the values in that column do not matter.

FredHutch / s3tagcrawler

adding a tag to existing entities in S3 #3