cbg-ethz / PredictHaplo

This software aims at reconstructing haplotypes from next-generation sequencing data.
GNU General Public License v3.0
5 stars 0 forks source link

Error: No valid reads were discovered. - Warning: 97.6% of the reads were discarded because they are unpaired. #27

Open Masterxilo opened 2 years ago

Masterxilo commented 2 years ago

Hi there me (Paul) & Lisa are trying to run predicthaplo on sam|bam files for NGS HIV data.

The aligned data was obtained using https://github.com/medvir/SmaltAlign/blob/handle_triplet_SNVs/smaltalign_indel.sh#L152

    log "Uncompressing gunzip (.gz) file"
    gzip -d /output/*_R1_*.fastq.gz
    gzip -d /output/*_R2_*.fastq.gz

    log "Concatenate forward and reverse reads"
    cat /output/*_R1_*.fastq /output/*_R2_*.fastq > /output/merged.fastq
#...

        smaltalign_indel.sh \
            -r "$REFERENCE_FILE" \
            -n $NUMBER_OF_READS \
            -i $ITERATION_COUNT \
            ./merged.fastq

# --> we take the file merged_ITERATION_COUNT_sorted.bam
# reference.fasta = HXB2, md5sum d79412993adaa28878de4d00f4d86cfe

We use

function bam2sam() {
    samtools view -h -o "$2" "$1"
}

bam2sam merged_ITERATION_COUNT_sorted.bam merged_ITERATION_COUNT_sorted.sam

to get the sam file, could this be a problem?

We then invoke predict haplo as follows and get this output:

+ ./predicthaplo --sam ./example-inputs/merged_ITERATION_COUNT_sorted.sam --reference ./example-inputs/reference.fasta
Configuration:
  prefix = predicthaplo_output/ph_
  cons = ./example-inputs/example.fasta
  visualization_level = 1
  FASTAreads = ./example-inputs/example.sam
  have_true_haplotypes = 1
  FASTAhaplos = 
  do_local_Analysis = 1
Warning: 2.4% of the reads were discarded because they are unmapped.
Warning: 97.6% of the reads were discarded because they are unpaired.
Error: No valid reads were discovered.
Masterxilo commented 2 years ago

The fastq NGS files where obtained from a virus mix.

Why does it say that no reads are valid even though according to the read filtering statistics some reads where not discarded?

kpj commented 2 years ago

Thanks a lot for submitting the issue!

The read filtering statistics say

Warning: 2.4% of the reads were discarded because they are unmapped.
Warning: 97.6% of the reads were discarded because they are unpaired.

This means that 2.4+97.6=100% of input reads were discarded.

At the moment, this version of PredictHaplo does not support unpaired reads. This seems to be the issue here.