epruesse / SINA

SINA - Reference based multiple sequence alignment
https://sina.readthedocs.io
GNU General Public License v3.0
40 stars 4 forks source link

necessary fields for "-f" not stated correctly in documentation, and also not given by "--arb-list-fields"? #96

Open jvollme opened 3 years ago

jvollme commented 3 years ago

Until now, I was still struggling to get the lca classification when running sina locally. Now, using the newest version on bioconda (v1.7.1), I still could not get it to work based on documentation at "readthedocs" or supplied by the help function of sina itself.

All of these seemed to suggest that the field to specify would be -f tax_slv. However, this just results in an empty column without information. Some further trial and error based on infos the verbose output finally led me to try -f lca_tax_slv. This finally seems to be the correct field, giving me the lca classification.

However this field is not only not mentioned in any of the documentations, but also not even when trying to get a list of the actual fields available in the reference-database itself (using "--arb-list-fields")?

Is this perhaps a specific error of this particular sina version? Or should the documentation be corrected?

epruesse commented 3 years ago

Hi @jvollme,

--arb-list-fields is new - I put it in exactly for your case. Knowing where the taxonomy might be stored in the reference database was a little too esoteric.

The reference database, e.g. the SILVA one, needs to contain a taxonomy classification in "materialized path" format. So some field that says "Bacteria; Proteobacteria; Gammaproteobacteria; ...". When you use --lca-fields <field>[:<field>,...], SINA will do a LCA style classification based on the input fields specified. It will put the output classification into lca_<field>. So if you say --lca-fields tax_slv it will generate lca_tax_slv. The -f flag is also new. It allows reducing the number of output fields to the CSVs become more usable. So technically, it should be --lca-fields tax_slv, -f lca_tax_slv.

Not as straight forward as I had thought. I wasn't using both at the same time.

It sounds like perhaps --lca-fields tax_slv should generate its output in tax_slv? Or at least have an option to do that? The original thought was to have it clear which fields where input data, and which where calls made by a different method.