Classify Phage - Genome Fractions

ManuelSokolov commented 7 months ago

Hi,

I am classifying Phages according to their taxonomy. I am having the issue that my fast afiles have fractions of the genome instead of the whole genome:

My fasta files have the format Genome1.fasta:

>k141_291006
TCG...
>k141_386008
TCG....

The PhaGCN classifies this phage genome as:

k141_291006,19687,Casjensviridae,0.17071722 
k141_386008,108404,Herelleviridae,1.0

So it gives two different classifications (yes one has probability 1.0 in other cases there isn't one with higher probability). The examples for this tool use whole genomes to classify the genome.

Can I just concat the sequences and classify as an entire genome?
Should I align the against each other using Multiple Sequence Alignment? I though to align against reference but it since I do not know to which family they belong finding an accurate reference genome is hard and not robust method.
Should I classify according to the most probable classification and if they are all very similar like 0.5,0.4 do a consensus?

Best Regards and Thank You

KennthShang commented 7 months ago

Hi,

Thanks for using our tools.

Of course, you can concat the sequences and classify them as an entire genome. Based on the algorithm design, this should not affect the prediction a lot I suppose (But we did not test it before).
Rather than using the multiple sequence alignment, maybe you should run cdhit or mmseq2 to check whether they are redundant sequences. Then, you can choose the representative sequences as your final genome.
if you have multiple sequences for one phage, you can also run the program on all of them and use the weighted major voting for the final prediction. To be specific, you can use the provided score as the weight and the prediction of each sequence for the vote.

Hope this information will help.

Best, Jiayu

ManuelSokolov commented 7 months ago

Hi Kennth, thank you for creating the tool and for the response.

I will test the options and let you know the result.

Best Regards,

Manuel

KennthShang / PhaGCN

Classify Phage - Genome Fractions #12