Open jnarayan81 opened 5 years ago
Hi @jnarayan81: Here is the paragraph where we discuss heterozygous samples (p. 3) "For highly heterozygous samples such as insects like mosquitoes, low frequency bases in aligned short reads may indicate inherent variation that are not necessarily sequencing errors. Correction algorithms that solely rely on a consensus call or majority vote often discard these heterogenous alleles. The optimization-based correction step of HECIL is not biased by bases which have high frequency, and hence, is better able to capture variation between similar individuals. This is corroborated by the results obtained from testing HECIL on the highly heterozygous mosquito data set of A. funestus."
Thanks. I run it on ONT reads, but it seems to run for infinite time on correction steps.
[jnarayan@hmem00 HECIL]$ squeue -u jnarayan
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1267206 High correcti jnarayan R 6-19:33:02 1 hmem01
I will be killed after 7 days, due to requested time limit on my server :( .. It does not seems to be writing anything in outfile. Does it expected to run for that long ?
Note: I run alignment on my local workstation and uploaded the *.sam and Pileup.txt on server. Hence commented the alignment section in bash file.
#!/bin/bash
# Align short and long reads
#echo 'Started running alignment'
#./bwa index $1 2>stdout.txt
#./bwa mem -t 40 $1 $2 > Out.sam 2>stdout.txt
# Create Pileup File
#./samtools view -bS Out.sam | ./samtools sort - Out.sort 2>stdout.txt
#./samtools mpileup -s -f $1 Out.sort.bam > Pileup.txt 2>stdout.txt
# Remove intermediate file
#rm $1.sa
#rm $1.pac
#rm $1.bwt
#rm $1.ann
#rm $1.amb
#rm Out.sort.bam
echo -e 'Finished running alignment\n'
#----------------------------------------------
# HECIL Core Algorithm
#Get no. of long reads
d=2
num_refline="$(wc -l "$1" | awk '{print $1}')"
num_ref=$(($num_refline / $d))
#'./Align.sh '+LR+' '+SR+' '+len_SR+' '+out+' '+lc+' '+c+' '+k
#$1=LR, $2=SR, $3=len_SR, $4=Output, $5=Uncorr (LowConf Corr), $6=cutoff_QC, $7=conf_threshold
echo 'Started running correction'
python2 Correction.py Pileup.txt $1 $5 $num_ref $3 $6 $7 > $4
echo -e 'Finished running correction\n'
num_correfline="$(wc -l "$4" | awk '{print $1}')"
num_corref=$(($num_correfline / $d))
echo -e 'Generating output file with all (corrected and uncorrected) reads\n'
python2 Create_Corrected_AllLRReads.py $1 $num_ref $4 $num_corref
rm $1.fai
rm Out.sam
echo -e "Finished running HECIL\n"
Could you let me know the size of the Pileup.txt file? You can also consider splitting the Pileup.txt file into multiple smaller files and running the correction step in parallel.
It is around 160GB .. I will try splitting it, and let you know.
[jnarayan@hmem00 HECIL]$ ls -lh
total 259G
-rw-rw-r-- 1 jnarayan jnarayan 1,2K 5. Okt 12:48 Align_Corr_backup.sh
-rwxrwxr-x 1 jnarayan jnarayan 784 5. Okt 12:48 Align_Corr_modified.sh
-rwxrwxr-x 1 jnarayan jnarayan 1,2K 5. Okt 14:30 Align_Corr.sh
-rwxrwxr-x 1 jnarayan jnarayan 1,4M 5. Okt 13:40 bwa
-rw-rw-r-- 1 jnarayan jnarayan 18K 5. Okt 13:40 Correction.py
-rw-rw-r-- 1 jnarayan jnarayan 1,3K 5. Okt 10:37 Create_Corrected_AllLRReads.py
-rw-rw-r-- 1 jnarayan jnarayan 1,4K 5. Okt 10:37 HECIL.py
-rw-rw-r-- 1 jnarayan jnarayan 653 5. Okt 16:21 HECILsubmit.sh
-rw-rw-r-- 1 jnarayan jnarayan 0 5. Okt 10:37 LowConf.txt
-rw-rw-r-- 1 jnarayan jnarayan 0 5. Okt 13:40 ONTOut
-rw-rw-r-- 1 jnarayan jnarayan 99G 5. Okt 15:49 Out.sam
-rw-rw-r-- 1 jnarayan jnarayan 160G 5. Okt 12:48 Pileup.txt
-rw-rw-r-- 1 jnarayan jnarayan 788 5. Okt 13:40 README.md
-rwxrwxr-x 1 jnarayan jnarayan 1,3M 5. Okt 13:40 samtools
-rw-rw-r-- 1 jnarayan jnarayan 619K 5. Okt 10:37 ShortRead.fq
-rw-rw-r-- 1 jnarayan jnarayan 224 5. Okt 16:22 slurm-1267206.out
-rw-rw-r-- 1 jnarayan jnarayan 0 5. Okt 14:14 stdout.txt
Hi @ochoudhu I did not find any proper answer on Heterozygosity awareness in the paper ! Does your tools take care of heterozygosity? How does it deal with heterozygous region?