ComputationalAgronomy / biopathway-prediction

0 stars 0 forks source link

Parallel running of blast #3

Closed yehzx closed 2 years ago

yehzx commented 2 years ago

Unexpected output when using parallel

Environment machine: AWS EC2 t3a.small version: zye21_3 Description I want to use multi-cpus to run blastp and I do see in prokka they use the following command to separate a list of protein sequences in a .faa file into single sequences and use parallel to pipe it to blastp.

cat NP1\/NP1\.sprot\.tmp\.12329\.faa | parallel --gnu --plain -j 2 --block 225323 --recstart '>' --pipe blastp -query - -db /home/zye21/.conda/envs/microbe/db/kingdom/Bacteria/sprot -evalue 1e-09 -qcov_hsp_perc 80 -num_threads 1 -num_descriptions 1 -num_alignments 1 -seg no > NP1\/NP1\.sprot\.tmp\.12329\.blast 2> /dev/null

So with this, I ran the following command (you can copy this and run it in my current instance)

cat test.faa | parallel --gnu --plain -j 2 --recstart '>' --pipe blastp -query - -db /home/zye21/blast_test/uniprot_sprot.fasta -evalue 1e-10 -num_threads 1 -num_alignments 3 -outfmt 5 -out result.xml

I think the main problem is: recstart doesn't work as I expected Expected

>Sequence_1
protein sequences...        -> job 1 to blastp
----------------------
>Sequence_2
protein sequences...        -> job 2 to blastp

Actually (all together)

>Sequence_1
protein sequences...
>Sequence_2
protein sequences...        -> job 1 to blastp

I refer to the documentation of parallel and read their instructions. Do I misunderstand anything?

stevenhwu commented 2 years ago

--block 225323

stevenhwu commented 2 years ago
 --block-size size
                Size of block in bytes to read at a time. The size can be postfixed with K, M, G, T, P, k, m, g, t, or p which
                would multiply the size with 1024, 1048576, 1073741824, 1099511627776, 1125899906842624, 1000, 1000000,
                1000000000, 1000000000000, or 1000000000000000, respectively.

                GNU parallel tries to meet the block size but can be off by the length of one record. For performance reasons
                size should be bigger than a two records. GNU parallel will warn you and automatically increase the size if you
                choose a size that is too small.

                If you use -N, --block-size should be bigger than N+1 records.

                size defaults to 1M.

                See --pipe and --pipepart for use of this.