KennthShang / PhaGCN2.0

26 stars 10 forks source link

Recommend for preparing viral bins for compatibility with PhaGCN input format #1

Closed liangjinsong closed 2 months ago

liangjinsong commented 1 year ago

Hi, PhaGCN seems take each contig in a fasta file as a viral genome, and outputs classification result for each contig in a fasta file. Then, an issue will occur when the input file is a viral bin, which contains several contigs of one viral genome in a fasta file. Deleting both the lines starting with '>' and line breaks (\n) in a viral bin seems a simple solution, do you think this method is reasonable? Do you have any recommends?

Thank you in advance.

yuanwenguang666 commented 1 year ago

It looks like your problem can be seen as a problem between segmented and non-segmented viruses. In our test for segmented virus, the results demonstrate that the segmented or non-segmented viruses does not have a significant impact on accuracy. If you ensure this contigs belong to one virus, we support you combine these virus contigs. After all, the longer the genome length, the more information it covers and the more accurate the predictions will be.

Thank you for your question.

Song-Yutong commented 1 year ago

I noticed that PhaGCN integrated into the online pipeline PhaBOX can pass the contigs with non-ATCG (>k142_test_contig4 within example file clear sequence).