How to run data - Githubissues

xiaozhangzhang123 commented 2 years ago

In the verify section, in the data downloaded from the RefSeq dataset, there are multiple > identified sequences in the FASTA file of each virus. Are these sequences contings? In the process of naive Bayesian classifier training and sequence prediction, do we use all sequences in FASTA file as the processing unit, or the part identified by > as the basic processing unit
The code in the second part 'viralverify' and 'training_script' Can the input of script code only be one sequence? How to deal with all data sets? You need to use script batch processing under Linux?

mikeraiko commented 2 years ago

Not sure that I follow you completely. The tool takes FASTA format as an input, where each contig is represented by description line started with ">" followed by contig name, and then sequence itself on the following lines.
You can provide FASTA file with one or multiple sequences as an input - each sequence will be analysed independently, and you'll get single output file for all input sequences.

xiaozhangzhang123 commented 2 years ago

Thank you very much for your reply. I still have questions to ask;

In training_ script ,when I use my own database to train the probability table, whether the input data is a single virus / bacteria / plasmid or a virus library / bacteria library / plasmid library.
In the benchmark stage, get the prediction results of all contings in the FASTA file. How to get the prediction results of the FASTA file and how to use it for benchmark

mikeraiko commented 2 years ago

It supposed to be a library. Technically, you can train it on a single virus/bacteria/plasmid, but it is meaningless - you will get statistics only for the few proteins, and results will be totally unreliable.
You can split your dataset in two parts - train and test. Then you can use train dataset to get your own probability table, and use test dataset for verification.

xiaozhangzhang123 commented 2 years ago

How to use virus library / bacteria library / plasmid library for training? What are the parameters after - V, - NV, - P? Can it be a folder address containing all FASTA files, or put all viruses into one FASTA file, all plasmids into one FASTA file, and all bacteria into one FASTA file?
I wonder if it is convenient to provide the training and test data set mentioned in your paper. Where can I get the plasmid data set?

Thank you very much!

------------------ 原始邮件 ------------------ 发件人: "ablab/viralVerify" @.>; 发送时间: 2022年3月29日(星期二) 晚上9:39 @.>; @.**@.>; 主题: Re: [ablab/viralVerify] How to run data (Issue #14)

It supposed to be a library. Technically, you can train it on a single virus/bacteria/plasmid, but it is meaningless - you will get statistics only for the few proteins, and results will be totally unreliable.

You can split your dataset in two parts - train and test. Then you can use train dataset to get your own probability table, and use test dataset for verification.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

mikeraiko commented 2 years ago

You can run training script without any parameters or with the "-h" key to see usage help. You should provide separate files with viruses, bacterial contigs and plasmids (using keys -v, -nv and -p, respectively).
I probably can dig it up and upload somewhere, but it will take some time. Anyway, they result in provided probability table. If you want to do some benchmarking, I suggest to use your own dataset - number of known plasmids and viruses has been expanded significantly since the paper publication.

ablab / viralVerify

How to run data #14