NDBL / HECIL

Hybrid Error Correction of Long Reads using Iterative Learning
10 stars 2 forks source link

Heterozygosity awareness ! #8

Open jnarayan81 opened 5 years ago

jnarayan81 commented 5 years ago

Hi @ochoudhu I did not find any proper answer on Heterozygosity awareness in the paper ! Does your tools take care of heterozygosity? How does it deal with heterozygous region?

ochoudhu commented 5 years ago

Hi @jnarayan81: Here is the paragraph where we discuss heterozygous samples (p. 3) "For highly heterozygous samples such as insects like mosquitoes, low frequency bases in aligned short reads may indicate inherent variation that are not necessarily sequencing errors. Correction algorithms that solely rely on a consensus call or majority vote often discard these heterogenous alleles. The optimization-based correction step of HECIL is not biased by bases which have high frequency, and hence, is better able to capture variation between similar individuals. This is corroborated by the results obtained from testing HECIL on the highly heterozygous mosquito data set of A. funestus."

jnarayan81 commented 5 years ago

Thanks. I run it on ONT reads, but it seems to run for infinite time on correction steps.

[jnarayan@hmem00 HECIL]$ squeue -u jnarayan
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1267206      High correcti jnarayan  R 6-19:33:02      1 hmem01

I will be killed after 7 days, due to requested time limit on my server :( .. It does not seems to be writing anything in outfile. Does it expected to run for that long ?

Note: I run alignment on my local workstation and uploaded the *.sam and Pileup.txt on server. Hence commented the alignment section in bash file.

#!/bin/bash

# Align short and long reads
#echo 'Started running alignment'
#./bwa index $1 2>stdout.txt
#./bwa mem -t 40 $1 $2 > Out.sam 2>stdout.txt

# Create Pileup File
#./samtools view -bS Out.sam | ./samtools sort - Out.sort 2>stdout.txt
#./samtools mpileup -s -f $1 Out.sort.bam  > Pileup.txt 2>stdout.txt

# Remove intermediate file
#rm $1.sa
#rm $1.pac
#rm $1.bwt
#rm $1.ann
#rm $1.amb
#rm Out.sort.bam

echo -e 'Finished running alignment\n'

#----------------------------------------------

# HECIL Core Algorithm

#Get no. of long reads
d=2
num_refline="$(wc -l "$1" | awk '{print $1}')"
num_ref=$(($num_refline / $d))

#'./Align.sh '+LR+' '+SR+' '+len_SR+' '+out+' '+lc+' '+c+' '+k
#$1=LR, $2=SR, $3=len_SR, $4=Output, $5=Uncorr (LowConf Corr), $6=cutoff_QC, $7=conf_threshold

echo 'Started running correction'
python2 Correction.py Pileup.txt $1 $5 $num_ref $3 $6 $7 > $4

echo -e 'Finished running correction\n'

num_correfline="$(wc -l "$4" | awk '{print $1}')"
num_corref=$(($num_correfline / $d))

echo -e 'Generating output file with all (corrected and uncorrected) reads\n'

python2 Create_Corrected_AllLRReads.py $1 $num_ref $4 $num_corref

rm $1.fai
rm Out.sam

echo -e "Finished running HECIL\n"
ochoudhu commented 5 years ago

Could you let me know the size of the Pileup.txt file? You can also consider splitting the Pileup.txt file into multiple smaller files and running the correction step in parallel.

jnarayan81 commented 5 years ago

It is around 160GB .. I will try splitting it, and let you know.

[jnarayan@hmem00 HECIL]$ ls -lh
total 259G
-rw-rw-r-- 1 jnarayan jnarayan 1,2K  5. Okt 12:48 Align_Corr_backup.sh
-rwxrwxr-x 1 jnarayan jnarayan  784  5. Okt 12:48 Align_Corr_modified.sh
-rwxrwxr-x 1 jnarayan jnarayan 1,2K  5. Okt 14:30 Align_Corr.sh
-rwxrwxr-x 1 jnarayan jnarayan 1,4M  5. Okt 13:40 bwa
-rw-rw-r-- 1 jnarayan jnarayan  18K  5. Okt 13:40 Correction.py
-rw-rw-r-- 1 jnarayan jnarayan 1,3K  5. Okt 10:37 Create_Corrected_AllLRReads.py
-rw-rw-r-- 1 jnarayan jnarayan 1,4K  5. Okt 10:37 HECIL.py
-rw-rw-r-- 1 jnarayan jnarayan  653  5. Okt 16:21 HECILsubmit.sh
-rw-rw-r-- 1 jnarayan jnarayan    0  5. Okt 10:37 LowConf.txt
-rw-rw-r-- 1 jnarayan jnarayan    0  5. Okt 13:40 ONTOut
-rw-rw-r-- 1 jnarayan jnarayan  99G  5. Okt 15:49 Out.sam
-rw-rw-r-- 1 jnarayan jnarayan 160G  5. Okt 12:48 Pileup.txt
-rw-rw-r-- 1 jnarayan jnarayan  788  5. Okt 13:40 README.md
-rwxrwxr-x 1 jnarayan jnarayan 1,3M  5. Okt 13:40 samtools
-rw-rw-r-- 1 jnarayan jnarayan 619K  5. Okt 10:37 ShortRead.fq
-rw-rw-r-- 1 jnarayan jnarayan  224  5. Okt 16:22 slurm-1267206.out
-rw-rw-r-- 1 jnarayan jnarayan    0  5. Okt 14:14 stdout.txt