NDBL / HECIL

Hybrid Error Correction of Long Reads using Iterative Learning
10 stars 2 forks source link

Parallelization of task #7

Closed arunprasanna83 closed 5 years ago

arunprasanna83 commented 5 years ago

Hi,

Quoting from your article "The current version of HECIL allows decomposition of the workload into independent data-parallel tasks that can be executed simultaneously. A natural extension of the tool will be to implement multi-threading to achieve speedup on traditional machines.".

How do I setup independent data-parallel tasks ? Can you share an example.

Thanks in advance. AP

ochoudhu commented 5 years ago

You can split the file containing long reads into multiple smaller files and run HECIL on each. For example, if you split a file LR.fa into 3 subsets (LR1.fa, LR2.fa, and LR3.fa), you can run the following 3 tasks in parallel:

python HECIL.py -l LR1.fa -s ShortRead.fq -len 202 -o Out1 python HECIL.py -l LR2.fa -s ShortRead.fq -len 202 -o Out2 python HECIL.py -l LR3.fa -s ShortRead.fq -len 202 -o Out3

After the completion of the tasks, you would have to merge the output files.

apn83 commented 5 years ago

You can split the file containing long reads into multiple smaller files and run HECIL on each. For example, if you split a file LR.fa into 3 subsets (LR1.fa, LR2.fa, and LR3.fa), you can run the following 3 tasks in parallel:

python HECIL.py -l LR1.fa -s ShortRead.fq -len 202 -o Out1 python HECIL.py -l LR2.fa -s ShortRead.fq -len 202 -o Out2 python HECIL.py -l LR3.fa -s ShortRead.fq -len 202 -o Out3

After the completion of the tasks, you would have to merge the output files.

I am not sure, it is possible without modifying the 'Align_Corr.sh'. In its present form it does ./bwa mem -t 12 $1 $2 > Out.sam 2>stdout.txt. When I split the long reads and run multiple instances, it keeps overwriting the alignment over and again and the jobs keep failing. I guess, it has to be modified to ./bwa mem -t 12 $1 $2 >> Out.sam 2>stdout.txt to append it and run ? Am I right ?

ochoudhu commented 5 years ago

If you append the SAM files, you may end up with a single Pileup file, which you would need to split again. In that case, you can run the script Correction.py for each Pileup subset. Otherwise, you could rename (using command line arguments) the intermediate files generated by Align_Corr.sh.