emmahodcroft / hello-world

GitHub tutorial
0 stars 0 forks source link

command line args & interplay with config #5

Open jameshadfield opened 5 years ago

jameshadfield commented 5 years ago

(i) --title -- as of the "title" branch, it overwrites any config-set title.

(ii) --maintainers and --maintainer-urls -- I've just updated the v6 branch so that now the maintainers overwrite the config file. Also ignores the URLs if they aren't the same length as the names.

(iii) --geography-traits -- in v6 this now overwrites anything specified in the config file. I think that's better than trying to merge with the config file options, as there aren't many of these and if you try to merge I don't know how you remove a trait that's in the config file?

(iv) --extra-traits it's a bit confusing in that traits found in node-data JSONs (usually via augur traits) will be exported by default unless you specify a config file. And that metadata TSV fields can be included, but you need to ask for them. As pointed out somewhere else, you can't exclude traits -- but why would you want to? Maybe we should call this --extra-colorings. If a config file with colorings exists what happens?

Writing this has made me think that we shouldn't have called augur traits traits... or that we shouldn't use "traits" as exported node properties.

(v) --panels -- should overwrite anything in the config file (TODO - it's currently the other way around)

(vi) What about filters? or just set to all the non-continuous colorings if a config is not set?

(vii) We should probably set the defaults to be the first one specified by command line flags where possible

emmahodcroft commented 5 years ago

So for everything other than traits/colorbys, I don't really have much issue with CL overriding config.

For filters - we don't have any CL argument for this and I think that's fine, personally. I would say if you don't specify in config then you get all non-continuous colourings. (But it is stuff like this that makes the CL-override-config thing weird to me - there are more options available in config so CL can never 'override' them... but anyway.)

For panels - I think this is fine in theory, but --panels has default values, so won't this always override config file - even if --panels isn't given in CL? (How would we tell the default values from the user having specified values in CL?)


For colouring traits - yes, this is where I think it gets more confusing. Currently this is how it works, I believe:

CL: Everything from node-data (augur traits & augur clades) + --geography-traits + --extra-traits is a colorby

config: Only things in config are colorby. (Kind of overrides CL, which is backwards from anything else.) (Exception is 'clades', which will be a colorby if passed in via node-data from augur clades, regardless of whether in config.) However, stuff from CL --extra-traits not in config will appear in export JSON innards. Similarly, stuff from augur traits passed in that's not in --geography-traits, --extra-traits, or config will also appear in export JSON innards.

As for why to exclude traits, I see two situations:

  1. In a very big run, I run augur traits on a bunch of stuff, say including 'host'. Once I take a look at the final result, I realise sampling is bad and the inference is bad, so I don't want to show 'host' to my collaborators who will misinterpret it. In the long run, I need to re-run augur traits without 'host', but in the moment, I want to send this off to them so I can get their thoughts on all the other stuff. Exporting without 'host' as a colorby is a quick way to get to this end without having to re-run augur traits.

  2. In a long-term project, I run augur traits on clinically sensitive data. I send these JSONs to my collaborators at the clinic with all data attached, but now and then I want to put the latest version up via community to share our work - without the sensitive data. Rather than have two projects with two versions of augur traits, I could easily modify my augur export to use one of two different config files. (In this scenario, it would work even better than currently, in that if a trait was not in config it is not exported at all, anywhere in the export JSON.)


Perhaps the easiest way to address this is to abandon my early idea where all traits in from augur traits are automatically colorbys, get rid of --extra-traits, and instead have something like --traits-to-color (avoiding names that are too similar to --colors which is how color files are passed), where you just have to specify, explicitly, what you want to colorby. (Excepting num-date and gt which I think we handle well.)*

Use might be something like this: CL: Only traits in --traits-to-color are color-by options and are written on nodes. 1,2 config: If --traits-to-color is specified, this overrides config. 3 Otherwise, only traits in config are color-by options and are written on nodes. 2

1 How do we handle clades? I personally think it's weird to make people have to write clade_membership as a coloring option when they have never seen this variable name. And when we'll be auto-detecting clade_annotation as a label and sticking that on, regardless. That's why I made this all automatic if the user passes in the output of augur clades. Ok to stick with that?

2 How do we handle --geography-traits? Currently in CL anything here is auto-added as a colorby (if not already auto-added by being from augur traits), so users don't have to write it twice - but this isn't end of the world. Do we want to allow something to be a geo-trait but not a colorby? In that case, values would be written out into the JSON even if not a colorby - but I think this is ok. (Just trying to avoid cases where something the user never mentions and isn't a colorby is 'secretly' written out in the JSON, and then they could share unknowingly.)

3 Should this override be complete? As in, if the user has given title and type in config but has specified the same --traits-to-color in CL, is the config file completely ignored, or do we still take the trait and type info?

* This is another case where CL>config acts a bit confusingly - I think it is 100% sensible to never ask the user to specify num-date or gt to be included because we can detect them, and like clades, they should never have to see the variable names (and shouldn't have to look up ours). However, if we want to allow users to change the trait name of these (which we added), then as raised in 3, config is going to kindof override CL here.


The big downside I see of getting rid of the auto-detecting colorbys from augur traits is just that I really like the idea that users can build a really simple run and they never have to think about what they want to colorby - anything they ran in traits will appear magically (if not beautifully). But then that's why to me the config override makes more sense - it allows more complicated options to be built on top of 'the basics' provided by CL. But the above idea, making users specify --traits-to-color explicitly, is maybe the best way to have everything else we want.


(vii) We should probably set the defaults to be the first one specified by command line flags where possible

Sorry, not 100% sure I understand this!

jameshadfield commented 5 years ago

Re: colours, I think the two aims of the CL args are: "easy & not-confusing to get up & running quickly" and "don't do things that could get me into trouble -- e.g. all metadata exported but hidden as it's not a color-by". On those notes it seems:

Option 1:

Option 2: Throw everything into the JSON. If you want to customize, then use a config file. Perhaps we don't even have any CL options to control this. We can auto-detect the appropriate geo-traits as any trait defined in the (default or CL provided) lat long files.


Exception is 'clades', which will be a colorby if passed in via node-data from augur clades, regardless of whether in config.

Have I got this right: (a) augur clades will export two "traits" - "clade_annotation" (on the CA of a clade) and "clade_membership" (a "normal" trait in that it's set on all nodes). (b) Currently, regardless of CL args and/or config, augur export v2 will convert any and all "clade_annotation" to node labels. (c) It will also create aa labels if aa mutations are present (i.e. most datasets) (d) export v2 will create a "clade_membership" color-by (title: "Clade") if it sees it in the nodes, and this can't be controlled by CL args nor configs.

emmahodcroft commented 5 years ago

Option 2: Throw everything into the JSON. If you want to customize, then use a config file. Perhaps we don't even have any CL options to control this. We can auto-detect the appropriate geo-traits as any trait defined in the (default or CL provided) lat long files.

I'm tempted by the simplicity of this option, but in reality every single metadata file I get (and even those I make myself from scraping GenBank) has way way way more things in it than you would ever want as colorby options 😕 . And I don't think we should encourage people to run augur traits on stuff just to get them as a colorby (if we used this instead of taking everything from metadata file).

So I'd go for Option 1. I think we could avoid some confusion by writing in the help/docs the inverse way that we tend to think about it: "Column names of traits in your metadata that you would like to colour by. If you have run augur traits on any of these, internal nodes will be coloured as well."

This then encourages users to open up their metadata and take a look at what they want. If they can't manage to remember to include the columns they were apparently interested enough in to run traits on, I'm not sure we can help! 🤷‍♀


RE clades - Yes, you have it right. My thinking here is two things:

Basically, my thinking here is to treat Clade like AA muts, nt muts, date, etc - we detect if it's there and include if it is, to avoid users having to remember to list these things (most of which have variable names they'll never have seen). Users can easily exclude by leaving it out. We could make this more explicit in the --help message. Something like:

"If the files you provide contain nucleotide or amino acid mutation information, date information, or clade information, they will automatically be included as coloring options by export. If you don't want to include this information, simply don't pass in the corresponding file with --node-data"

(I am unsure if this would actually work with date (branch_lengths), but we could hammer the details out.)

emmahodcroft commented 5 years ago

I'm not sure how best to do this, but would be good to get this sorted so I can get docs in a good state explaining this :)

I vote we move to --traits-to-color as above (must specify everything, even if already in geo traits), and treat Clades as I describe above.

Then need to settle:

jameshadfield commented 5 years ago

Sounds good, agree with most of it 😉

emmahodcroft commented 5 years ago

I just another thought that's left me uncertain again - about the output of seqtraits (in all of my use, this is for DRM detection).

I just had a look at the tb_drm test build, and this works sooooo nicely here. In v1, you had to add to the config each specific drug that you wanted to colorby. I always had to start by adding all of them (there are 10!) because you don't know what your dataset has - then removing those that weren't in a given dataset.

How export v2 is working now, you simply import the seqtraits output in --node-data, and it takes care of everything. Any drugs that have resistance data are auto-included, and any that don't, are auto-excluded.

I'm not working so much with TB data ATM, which is why I think I completely missed this until now, but I believe this was a big factor in why I came up with the 'auto use everything passed in in node-data' originally.

If we move to --traits-to-color, would we be ok with treating seqtraits output differently, like Clades?

emmahodcroft commented 5 years ago

WRT your comment on Clades. In theory I agree, but in practice, having augur clades start having two types of output (and possibly augur seqtraits too) is going to really throw some rocks in the path for users trying to get things working. They'll have to also now specify the type of output for these functions - and it'll need to match their export version.

IMO, this may exceed some users' frustration limit...

Really, though to us the internal construction of the file makes it seem more intuitive that it's just added automatically, I'm not sure most users will notice or appreciate this fact. I think what's most important is that we can explain it, and that we can be consistent.

Could having this general rule work?:

I think the only thing that violates this somewhat is date info - and this is because we internally manipulate it and it has rather a special role. Here, even though it's a column in metadata, it's not that column that's used for everything, it's the info in branch_lengths.json, so it works like the second category.

emmahodcroft commented 5 years ago
  • Previously if there was a config file you wouldn't get filters unless you specified them. Your way is consistent, so as long as it's documented it's all good.
  • Maybe remove the defaults from --panels and export JSONs without them if they're not specified anywhere. Auspice has it's own defaults.

Both of these now addressed. Filters works as I specified above. Panels - defaults removed from the arg, but they are somewhat reinstated later in process_panels - but this does not impeded functionality, detects what data is present and what isn't, and is ok, I think.

emmahodcroft commented 5 years ago

So here's how I'm kinda going with this in the config part of the docs. I'm happy to present this for discussion to the larger group if you think it's an ok direction:

Outline of what we want:

1 User can put them in config if they want to change title (can't change type)

How determine what is a colorby

General proposal:

2 Or config if using only config - override rules addressed below.

Caveats:

3 If you include date (as in, from metadata) in --extra-traits currently, things get weird. Dunno how we want to handle this.

Going forward:

4 I am not super keen on doing this now, as I think it will make the version change even more confusing for users, but I would be in favour of trying to do it in the next few months or so (while having a think about how to do it least-disturbingly). I'd also like to prioritise doing all the stuff that needs doing and getting this v6/v2 out the door, rather than taking on more work that pushes this back!


CL and config for Colorby Traits: What Overrides What

Examples of various combinations of normal metadata traits, excluding complex things like gt, numdate, Clades etc.

In general, CL overrides config except for pulling title and type info from config files, if available. If you are using CL, what's in config kind of doesn't matter (doesn't have to be everything given in CL), except if providing title and type to stuff already mentioned in CL. If just using config, then must list everything you want to colorby.

CL config Displays
--traits-to-color age host (none) "age" (type: guess) "host" (type: guess)
--traits-to-color age host "colorings:" {"age":{}, "host":{}} "age" (type: guess) "host" (type: guess)
--traits-to-color age host "colorings:" {"age":{}} or "colorings:" {"age":{}, "host":{}, "symptom":{}} "age" (type: guess) "host" (type: guess)$
--traits-to-color age "colorings:" {"age":{}, "host":{}} "age" (type: guess)$
--traits-to-color age host "colorings:" {"age":{"title":"Age in Days", "type":"continuous"}, "host":{}, "symptom":{"title":"Symptom"}} "Age in Days" (type: continuous) "host" (type: guess)%
(none) "colorings:" {"age":{}, "host":{}} "age" (type: guess) "host" (type: guess)
(none) "colorings:" {"age":{"title":"Age in Days", "type":"continuous"}, "host":{}} "Age in Days" (type: continuous) "host" (type: guess)

$ CL overrides config % CL overrides config, but draws title/type information from config traits if they are present in CL

jameshadfield commented 5 years ago

Emma I think this is great. Let's just do it, and if people have strong feeling's they'll complain during the testing phase. Thanks for all your work here, it's great.

jameshadfield commented 5 years ago

My only point: if traits are automatically included (via CL), then perhaps use --metadata-fields or something instead of --traits-to-color as lots of people don't understand what traits really means, and we've overused the term!

emmahodcroft commented 5 years ago

😊 Ok, I'll go for it!

And yes, happy to change --traits-to-color - totally pulled that out of the air somewhere upstream. I think maybe putting 'color' in there somewhere would be good as it makes it really explicit. Maybe --metadata-to-color? Or --color-by? (ex: --color-by country region age reads quite nicely and inuitively?). Or combined the two: --metadata-to-color-by. No strong feelings though.

I'm at Roslin tomorrow so may not be able to tackle this immediately - but I will!

emmahodcroft commented 5 years ago

'Tis done.......! 💀 👩‍💻

jameshadfield commented 5 years ago

🙌