biocom-uib / vpf-tools

Virus Protein Family tools
BSD 3-Clause "New" or "Revised" License
26 stars 7 forks source link

vpf-class --chunk-size parameter #29

Open wangyaxiang008 opened 2 years ago

wangyaxiang008 commented 2 years ago

hi, Thank you very much for providing such a useful virus classification and host prediction software. I have a question about chunktest parameters. In your github, it is introduced that adding the --chunk-size parameter will improve the speed, so I compared the results of adding this parameter and not adding this parameter. Adding this parameter does improve the speed, but the results are different. Add This parameter will result in fewer results. I now need to run a dataset of millions, so had to add the chunktest parameter, but adding that parameter makes the results less, which is really bothering me. Hope to get your help, thanks

command one :vpf-class-x86_64-linux@dd88a543f28eb339cf0dcb89f34479c84b3a8056 -i ../DNA.DB.retain.contig.fa --workers 50 --data-index /hl/YaxiangWang/Soft/vpf-tools/data/index.yaml -o ./ command two :vpf-class-x86_64-linux@dd88a543f28eb339cf0dcb89f34479c84b3a8056 -i ../DNA.DB.retain.contig.fa --workers 50 --chunk-size 1000 --data-index /hl/YaxiangWang/Soft/vpf-tools/data/index.yaml -o ./

result one: image result two: image

bielr commented 2 years ago

Hi,

You're right, --chunk-size affects the number of results. This is because --chunk-size (which defaults to 1, the default clearly needs to be improved) determines the input of each execution of Prodigal, which gives a different number of results depending of the input size (as that is what it uses to train itself). My guess is that the larger the input size, the better the results are, which also means less noise/false positives. The speed improvement comes from less executions (again, the default is to split the input into individual sequences, so that means one execution per input sequence).

wangyaxiang008 commented 2 years ago

Thank you for your reply. If I input a dataset with 1,000,000 contigs and a dataset with 100,000 contigs, how can I choose this parameter to have the lowest error rate? Do you have any good advice ?

thanks

bielr commented 2 years ago

Not an expert on prodigal, but probably maximizing --chunk-size is your best bet. The point of this parameter is to decide how granular the parallelism is, so the biggest I would suggest would be number of input sequences / number of workers.

wangyaxiang008 commented 2 years ago

thanks for your help, i will test it as you suggested and let you know my results later

wangyaxiang008 commented 2 years ago

When I set the chunk-size parameter to the number of contigs for my input dataset, I do get fewer results than when the parameter is set to 1. image When the task went to the second step - hmmsearch, I found that the command called only one thread, as soon as I set the --workers parameter to 40.

bielr commented 2 years ago

That makes sense. For the second part, the number of workers can't possibly be greater than the number in which the input is split, and when chunk-size equals the sample size there is exactly one chunk (we disable multithreading in hmmsearch as it is quite limited compared to manually splitting the input).

One possiblity is to split the input after running prodigal. What do you think?