McGranahanLab / mhc-hammer

Pipeline to detect HLA disruption from WES and RNAseq data
Other
15 stars 4 forks source link

Issue with contig '6' and when there are no secondary contigs #8

Open ilobon opened 2 months ago

ilobon commented 2 months ago

Hi,

I had a few errors subsetting bams, because my RNA bams reference has non-chr contigs (i.e. '6' , not 'chr6') and only the main chr contigs. My DNA bam has other contigs so I still want to run the pipeline with contig_reads=T. I am sharing the edits to bin/subset_bam_opt.sh that solved it for me.

1. non 'chr' chromosome 6

From line 144:

# first, check the chromosome name in hla_coords is present in the bam file
chr_name=$(echo ${hla_coords} | cut -d ":" -f 1)
if ! samtools view -H ${bam} | grep -q ${chr_name}; then
    echo "Chromosome name ${chr_name} not found in bam file ${bam}"
    # if not, try without "chr" prefix
    # if not in the bam either, exit with error
    chr_name=$(echo $chr_name | sed s/^chr//g)
    echo "Trying with ${chr_name} instead"
    if ! samtools view -H ${bam} | grep -q ${chr_name}; then
        echo "Error: chromosome name ${chr_name} not found in bam file ${bam} either - quitting!!"
        exit 1
    fi
fi

echo Extracting reads aligning to ${chr_name}
samtools view -@ ${task_cpus} -bh -o ${tmp_dir}/region.bam ${bam} ${chr_name}

and also the print statement at the end to have $chr_name instead of $hla_coords

printf " %-55s %10s\n" "Number of reads extracted from ${chr_name}:" "${region_read_count}"

2. manage empty grep statement when there are no secondary contigs and skip the read extraction

I added a true or statement because the empty grep made it crash

# Filter out standard chromosomes from the contig list, keeping only non-standard contigs in the output file
true || grep -v -E '^chr[1-9]{1}$|chr1[0-9]{1}$|chr2[0-2]{1}$|^[0-9]{1}$|^1[0-9]{1}$|^2[0-2]{1}$|^chr[XYMT]{1,2}$|^[XYMT]{1,2}$' ${tmp_dir}/contigs.txt > ${contig_file}

and moved the section between comments "# Iterate through the non-standard contigs in the ${contig_file} file" and "# Extract reads that contain HLA reference kmers if params == true" to inside an if $contig_file exists statement:


        # only if contig file exists
        if [ -s "$contig_file" ]; then

            # Iterate through the non-standard contigs in the ${contig_file} file
            cat ${contig_file} | while read line || [[ -n $line ]];
            do  
                # If the contig is a standard chromosome, skip extraction
                if [[ "${chroms[*]}\n" =~ "${line}" ]]; then
                    echo ${line} will not be extracted
                else
                    # Extract reads mapping to the non-standard contig and append them to the contig.bam file
                    echo Extracting reads mapping to ${line}
                    samtools view -@ ${task_cpus} -bh -o ${tmp_dir}/${line}_tmp_contig.bam ${bam} ${line}
                    # Count the number of reads in the temporary contig.bam file
                    contig_read_count=$(samtools view -@ ${task_cpus} -c ${tmp_dir}/${line}_tmp_contig.bam)
                    # Append the contig name and read count to the contig_read_counts.csv file
                    echo ${line},${contig_read_count} >> "${out_prefix}.read_counts.csv"
                fi
            done

            tmp_contig_bams=$(ls ${tmp_dir}/*tmp_contig.bam)

            # Merge input.bam and tmp_contig.bams
            samtools merge -c -p -@ ${task_cpus} ${tmp_dir}/contigs.bam ${tmp_contig_bams}

            # remove temporary contig.bam files
            rm ${tmp_contig_bams}

            # Count the number of reads in the contigs.bam file
            contig_read_count=$(samtools view -@ ${task_cpus} -c ${tmp_dir}/contigs.bam)

            # Merge input.bam and contigs.bam
            samtools merge -c -p -@ ${task_cpus} ${tmp_dir}/merged_input.bam ${tmp_dir}/input.bam ${tmp_dir}/contigs.bam

            # remove input.bam and contigs.bam
            rm ${tmp_dir}/contigs.bam ${tmp_dir}/input.bam

            # rename merged_input.bam to input.bam
            mv ${tmp_dir}/merged_input.bam ${tmp_dir}/input.bam
        fi

    fi

Best,

Irene

clareputtick commented 1 month ago

Hi Irene,

Thanks for sharing the edits - we'll work on incorporating them into our code.

Best, Clare