epruesse / SINA

SINA - Reference based multiple sequence alignment
https://sina.readthedocs.io
GNU General Public License v3.0
40 stars 4 forks source link

align_ident_slv field blank if not 100% identity #49

Closed KasperSkytte closed 5 years ago

KasperSkytte commented 5 years ago

Hi!

I have used SINA 1.4.0 to align and assign taxonomy to some full length 16S sequences. When inspecting the output, the field align_ident_slv in the output CSV file only shows sequences with an identity of 100, everything else is just blank. I ran the same data on SINA 1.3.5 last week and all fields in the column had a value. This is the command I ran:

sina -i fssu.fa --intype fasta -o output.fa \
  --meta-fmt csv nearest nuc achieved_idty name \
  -r refdatabases/SILVA_132_SSURef_NR99_13_12_17_opt.arb --search \
  --search-db refdatabases/SILVA_132_SSURef_NR99_13_12_17_opt.arb \
  --lca-fields tax_slv --search-min-sim 0.5 --search-max-result 1 \
  --num-pts $((THREADS / 2)) --threads $THREADS

I run Ubuntu 18.04 LTS on a Dell 7920 with 40 threads + 128GB RAM.

Thanks in advance Kasper

epruesse commented 5 years ago
  sina -i fssu.fa --intype fasta -o output.fa \

You don't have to specify intype or outtype - default is fasta anyway.

  --meta-fmt csv nearest nuc achieved_idty name \

The arguments nearest nuc achieved_idty name won't do anything. They are not preceded by any option and ignored by boost program options. I guess I need to figure out how to get those from boost so I can throw an error.

  -r refdatabases/SILVA_132_SSURef_NR99_13_12_17_opt.arb --search \
  --search-db refdatabases/SILVA_132_SSURef_NR99_13_12_17_opt.arb \

You don't have to specify --search-db if it is the same as --db. That's the default anyway.

  --lca-fields tax_slv --search-min-sim 0.5 --search-max-result 1 \

Ok so best match, no matter how far away. Got it ... you really want everything to have some classification. ;)

  --num-pts $((THREADS / 2)) --threads $THREADS

Threads will default to the number of threads the computer has minus the number currently occupied by other processes. Usually works pretty well. I think I might increase the default for num-pts - memory use isn't that bad. For the SILVA NR, it's mostly opening the database that eats some 5GB inside of SINA itself.

epruesse commented 5 years ago

When inspecting the output, the field align_ident_slv in the output CSV file only shows sequences with an identity of 100, everything else is just blank. I ran the same data on SINA 1.3.5 last week and all fields in the column had a value

Did you use --calc-idty or --min-idty when you ran it last time? Without those, the column shouldn't be there at all actually.

KasperSkytte commented 5 years ago

Hi again

I didn't use --calc-idty before and the column shows anyways. But now that I do it has a value in all fields, so I guess that fixes it. I'll reduce the number of arguments a bit when they are superfluous anyways. I am trying to generate denovo taxonomy based on full length exact sequence variants (based on this work) and then merge with SILVA. So yes, everything and best matches only :) Regarding threads, unfortunately it only runs on full power initially, and then ends up with only 1-2 threads, because I use --search as mentioned in #32. It only uses 1 PT server then.

Thanks for your prompt response -Kasper

epruesse commented 5 years ago

I didn't use --calc-idty before and the column shows anyways. But now that I do it has a value in all fields, so I guess that fixes it.

Hmm. I think I did do a little cleanup in that section, it's possible I "fixed" that. I guess I could make it a default, it doesn't take that long to compute after all. I added it at some point because the alignment score (which says something similar) is somewhat dependent on the alignment parameters.

I'll reduce the number of arguments a bit when they are superfluous anyways. I am trying to generate denovo taxonomy based on full length exact sequence variants (based on this work) and then merge with SILVA. So yes, everything and best matches only :)

To be honest, the whole LCA classification was just added because people were using the alignment reference to classify their sequences, and that's just a heuristic. The search module does a pairwise comparison of the top 1000 hits (default) based on the already existing alignment. You can configure how that identity is calculated - there are quite a few different ways, that can make large differences in practice (Do you want to count gaps as differences? Only in the reference, only in the query, or both? Do you want to apply a Jukes Cantor correction? Do you want to count parts of the sequence that "overhang"?).

One thing to be aware of is that this uses the MSA and does not calculate the minmal pairwise distance. Combined with the default to count gaps, the identity calculated by SINA is usually lower than what you would normally see.

The LCA has one uncommon parameter - the quorum. By default, it would consider the top 10 matches and assign the deepest classification level shared by at least 70% of those ten. It's primitive. I designed it back then to match what I figured the Biologist would do - BLAST it and if 7 out of the top 10 agree on something, that should be what it is.

The best-match-at-all-cost configuration you chose will "achieve" maximum depth classifications at the cost of accuracy. It'll be wrong plenty of times - it's basically a best-blast approach.

Regarding threads, unfortunately it only runs on full power initially, and then ends up with only 1-2 threads, because I use --search as mentioned in #32. It only uses 1 PT server then.

Yes ... I know, that's a little unfortunate. In particular since you use the full NR for both alignment and search with the same parameters, so the second search is actually completely redundant. The problem is that you can in theory use two different databases for alignment and classification and two different sets of parameters, so there'd be a lot of (error prone) plumbing to get this to work properly.

I'm working on refactoring that whole part at the moment so that I can make --fs-engine internal work correctly. Once that's done, I might fix #32, but then hopefully it won't matter any more. I'm expecting to be quite a bit faster than the PT server.