Closed KasperSkytte closed 5 years ago
sina -i fssu.fa --intype fasta -o output.fa \
You don't have to specify intype or outtype - default is fasta anyway.
--meta-fmt csv nearest nuc achieved_idty name \
The arguments nearest nuc achieved_idty name
won't do anything. They are not preceded by any option and ignored by boost program options. I guess I need to figure out how to get those from boost so I can throw an error.
-r refdatabases/SILVA_132_SSURef_NR99_13_12_17_opt.arb --search \
--search-db refdatabases/SILVA_132_SSURef_NR99_13_12_17_opt.arb \
You don't have to specify --search-db
if it is the same as --db
. That's the default anyway.
--lca-fields tax_slv --search-min-sim 0.5 --search-max-result 1 \
Ok so best match, no matter how far away. Got it ... you really want everything to have some classification. ;)
--num-pts $((THREADS / 2)) --threads $THREADS
Threads will default to the number of threads the computer has minus the number currently occupied by other processes. Usually works pretty well. I think I might increase the default for num-pts
- memory use isn't that bad. For the SILVA NR, it's mostly opening the database that eats some 5GB inside of SINA itself.
When inspecting the output, the field align_ident_slv in the output CSV file only shows sequences with an identity of 100, everything else is just blank. I ran the same data on SINA 1.3.5 last week and all fields in the column had a value
Did you use --calc-idty
or --min-idty
when you ran it last time? Without those, the column shouldn't be there at all actually.
Hi again
I didn't use --calc-idty
before and the column shows anyways. But now that I do it has a value in all fields, so I guess that fixes it.
I'll reduce the number of arguments a bit when they are superfluous anyways. I am trying to generate denovo taxonomy based on full length exact sequence variants (based on this work) and then merge with SILVA. So yes, everything and best matches only :)
Regarding threads, unfortunately it only runs on full power initially, and then ends up with only 1-2 threads, because I use --search
as mentioned in #32. It only uses 1 PT server then.
Thanks for your prompt response -Kasper
I didn't use
--calc-idty
before and the column shows anyways. But now that I do it has a value in all fields, so I guess that fixes it.
Hmm. I think I did do a little cleanup in that section, it's possible I "fixed" that. I guess I could make it a default, it doesn't take that long to compute after all. I added it at some point because the alignment score (which says something similar) is somewhat dependent on the alignment parameters.
I'll reduce the number of arguments a bit when they are superfluous anyways. I am trying to generate denovo taxonomy based on full length exact sequence variants (based on this work) and then merge with SILVA. So yes, everything and best matches only :)
To be honest, the whole LCA classification was just added because people were using the alignment reference to classify their sequences, and that's just a heuristic. The search module does a pairwise comparison of the top 1000 hits (default) based on the already existing alignment. You can configure how that identity is calculated - there are quite a few different ways, that can make large differences in practice (Do you want to count gaps as differences? Only in the reference, only in the query, or both? Do you want to apply a Jukes Cantor correction? Do you want to count parts of the sequence that "overhang"?).
One thing to be aware of is that this uses the MSA and does not calculate the minmal pairwise distance. Combined with the default to count gaps, the identity calculated by SINA is usually lower than what you would normally see.
The LCA has one uncommon parameter - the quorum. By default, it would consider the top 10 matches and assign the deepest classification level shared by at least 70% of those ten. It's primitive. I designed it back then to match what I figured the Biologist would do - BLAST it and if 7 out of the top 10 agree on something, that should be what it is.
The best-match-at-all-cost configuration you chose will "achieve" maximum depth classifications at the cost of accuracy. It'll be wrong plenty of times - it's basically a best-blast approach.
Regarding threads, unfortunately it only runs on full power initially, and then ends up with only 1-2 threads, because I use
--search
as mentioned in #32. It only uses 1 PT server then.
Yes ... I know, that's a little unfortunate. In particular since you use the full NR for both alignment and search with the same parameters, so the second search is actually completely redundant. The problem is that you can in theory use two different databases for alignment and classification and two different sets of parameters, so there'd be a lot of (error prone) plumbing to get this to work properly.
I'm working on refactoring that whole part at the moment so that I can make --fs-engine internal
work correctly. Once that's done, I might fix #32, but then hopefully it won't matter any more. I'm expecting to be quite a bit faster than the PT server.
Hi!
I have used SINA 1.4.0 to align and assign taxonomy to some full length 16S sequences. When inspecting the output, the field
align_ident_slv
in the output CSV file only shows sequences with an identity of 100, everything else is just blank. I ran the same data on SINA 1.3.5 last week and all fields in the column had a value. This is the command I ran:I run Ubuntu 18.04 LTS on a Dell 7920 with 40 threads + 128GB RAM.
Thanks in advance Kasper