basehc / IPEV

Software tool to identify prokaryotic and eukaryotic virus-derived sequences in virome using deep learning. Used to calculate a set of scores that reflect the probability that input sequence fragments are prokaryotic and eukaryotic viral sequences.
GNU General Public License v3.0
5 stars 0 forks source link

About Non-Virus Removal Feature in IPEV Program #15

Closed shengxingou closed 6 months ago

shengxingou commented 6 months ago

I am very grateful to you for developing this excellent software. I want to know how the function of eliminating false positive non viral components (bacteria and fungi) in the software is implemented. Is this feature implemented through Convolutional Neural Networks (CNNs)? If so, how effective is this feature?

Thank you for your valuable time and energy!

basehc commented 6 months ago

Yes, this function was achieved by using a convolutional neural network to distinguish between viruses and non-viral components (bacteria and fungi), using the sequence pattern matrix and a neural network framework within IPEV. Searches were performed on the NCBI Assembly database using the commands “(Fungi[orgn]) AND (representative genome[filter])” and “(Bacteria[orgn]) AND (representative genome[filter])” to download data sets. A total of 1,428 bacterial genome sequences were downloaded, with plasmids excluded and only those labeled as “complete” included, and 99 fungal sequences, excluding macrofungi. Using MetaSim, contigs were simulated to create negative sample groups for both bacterial and fungal data sets separately: 1 million for Group A (100-400 bp), 900,000 for Group B (400-800 bp), 800,000 for Group C (800-1,200 bp), and 700,000 for Group D (1,200-1,800 bp). Based on the sequence pattern matrix and neural network framework outlined in our manuscript, the model was trained, validated, and tested on the constructed Groups A to D using an 8:1:1 split for training, validation, and testing, respectively.

In this experiment, test sets were constructed with an equal number of virus-like and non-virus-like organisms (bacteria and fungi) in Group A - Group D. Sensitivity (Sn) and specificity (Sp) for each group were then calculated. For viral identification, Sn rates were 73.1% for Group A, 83.3% for Group B, 90.5% for Group C, and 93.1% for Group D. Correspondingly, for non-viral identification, Sp rates were 84.5% for Group A, 93.7% for Group B, 95.7% for Group C, and 97.3% for Group D. This enhanced functionality, aimed at reducing false-positive (non-viral) samples, has been integrated into the IPEV tool, complete with a user-activated switch. 1

shengxingou commented 6 months ago

thank you for your reply