ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
254 stars 33 forks source link

Serraplace validation #173

Closed rcedgar closed 4 years ago

rcedgar commented 4 years ago

I posted a tsv with comparison of Serraplace and Serratax on contigs from the first 1k assembly test (#162) here:

serratax_vs_serraplace_1k.tsv

Fields in the tsv file are 1. SRA accession, 2. Serraplace taxonomy, 3. Serratax taxonomy.

Questions / feature requests:

  1. The Serraplace output only has taxonomy AFAICS. Can we get the placement in the tree, e.g. the best fit edge and bootstrap or something like that?

  2. The taxonomies from Serraplace all terminate at sub-genus or higher, while Serratax often gives species and sometimes strain. Reviewing some examples of these cases, I think Serraplace is too conservative, it should be possible to resolve at least species. Is this a limitation of the method, or can this be improved?

  3. Can the Serraplace taxonomy be reported as an NCBI taxonomy id?

ababaian commented 4 years ago

What are the - is this not performed or no answer from each software?

rcedgar commented 4 years ago

"-" is no classification available. With Serratax, it was attempted but no classification reported, I think the same for Serraplace but not sure.

pierrebarbera commented 4 years ago

I posted a tsv with comparison of Serraplace and Serratax on contigs from the first 1k assembly test (#162) here:

serratax_vs_serraplace_1k.tsv

Fields in the tsv file are 1. SRA accession, 2. Serraplace taxonomy, 3. Serratax taxonomy.

Questions / feature requests:

  1. The Serraplace output only has taxonomy AFAICS. Can we get the placement in the tree, e.g. the best fit edge and bootstrap or something like that?

The pipeline will produce all intermediate files, which includes the .jplace file, where the detailed per-query placements are found

  1. The taxonomies from Serraplace all terminate at sub-genus or higher, while Serratax often gives species and sometimes strain. Reviewing some examples of these cases, I think Serraplace is too conservative, it should be possible to resolve at least species. Is this a limitation of the method, or can this be improved?

The assignment somewhat depends on a "distribution ratio", basically if a placement has 1.0 LWR on an edge, by default the algorithm will distribute that weight according to how close the placement was to either side of the edge. So if the attachment point of the query was 90% toward one end of the edge, then 90% of that LWR will be assigned to the taxonomic label at that node of the tree (which should be strain level if that edge belongs to a leaf of the tree).

What we can try there is to always assign the entire weight to the distal node, which assigns more weight toward the leaves. If assignments are still non-specific, that means that placement was non-specific.

  1. Can the Serraplace taxonomy be reported as an NCBI taxonomy id?

This is not currently implemented.

What are the - is this not performed or no answer from each software?

So far I've only included the 136 cat-A assemblies, so this is probably the case of the non-assignments. If it aligns, it will place in some way. Worst case for things that align is assignment at the highest taxonomic level