biocore / empress

A fast and scalable phylogenetic tree viewer for microbiome data analysis
BSD 3-Clause "New" or "Revised" License
45 stars 31 forks source link

Adding info on feature metadata format in README #548

Open Robaina opened 2 years ago

Robaina commented 2 years ago

Hi there,

thanks for developing empress... it's great! However, I can't seem to find how to format the feature metadata tsv file required by parameter --feature-metadata. I imagine that node labels and color code should be included somehow, but just don't know how to do it in the right format.

Would be great to provide this information in README.md, or, at least, include a tsv file with a minimal example.

Thanks!

kwcantrell commented 2 years ago

@Robaina thanks for the suggestion! I'll try to put together a mini tutorial for setting up the metadata file.

But in the meantime, EMPress follows the qiime2 standard for metadata. Essentially, the .tsv should be a tab delimited file where the first column is labeled Feature ID.

In terms of node color, EMPress has coloring utilities built in that color nodes based on feature metadata values (and provides multiple different color maps you can choose from) so, you do not need to specify node color info in the feature metadata file.

If it helps, here is the feature metadata file used in the tutorial. taxonomy.tsv.zip

fedarko commented 2 years ago

Update: @kwcantrell got to this before I could finish writing my response out ;) In case it's helpful, I'm enclosing my response below to complement Kalen's:


Thanks @Robaina! This is a great point -- it'd be good to have detailed information on this in the README. You're right, we don't yet have very clear documentation of this anywhere.

In the meantime, for reference (or for anyone who winds up at this issue from searching, etc.): feature metadata TSV files for the standalone CLI version of EMPress should usually be formatted something like this example (this is a subset of the "moving pictures" tutorial data):

Feature ID  Taxon   Confidence
4b5eeb300368260019c1fbc7a3c718fc    k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__ 0.9972511412166732
fe30ff0f71a38a39cf1717ec2be3a2fc    k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Neisseriales; f__Neisseriaceae; g__Neisseria  0.9799426564410937
d29fe3c70564fc0f69f2c03e0d1e5561    k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus   0.9999999999714458
868528ca947bc57b69ffdf83e6b73bae    k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__ 0.995585946749247

I believe the only requirements are:

These can be tip nodes or they can be internal nodes in the Newick file that you pass to EMPress. In the standalone version of EMPress I don't think this first column needs to be named Feature ID, but I think it does need to be named that (or something like that) if using this file with QIIME 2.

The remaining columns of the feature metadata TSV file can contain any node metadata (categorical or quantitative) that you want. The example here uses two metadata columns, Taxon (indicating the taxonomic annotations assigned to these features) and Confidence (corresponding to the confidence of these taxonomic annotations -- details here). Both columns can be used to assign colors to these nodes, shear the tree, etc. in EMPress' interface.

One special thing worth noting is that EMPress makes note of feature metadata columns labelled Taxon or Taxonomy (ignoring case). If one of these columns exists in your dataset, EMPress will assume its entries contain taxonomic annotations (like the ones shown above), and will split it up by these annotations by semicolons (;) so that you can view the tree using different levels of taxonomy. In the EMPress interface these different taxonomy levels will be labeled Level 1, Level 2, Level 3, etc. going from less to more specific, e.g. Kingdom, Phylum, Class, ....

Robaina commented 2 years ago

Thanks for the great replies @kwcantrell and @fedarko! I'll look into these examples. Perhaps just adding that feature metadata follows quime2 standard in README will suffice.

I'm so far using standalone empress within my python pipeline so the tree is automatically generated and displayed in the browser when done. Will take a look at quime2 too.

Thanks!