LarsNauheimer / HybPhaser

Workflow to detect and phase hybrids in target capture data
GNU General Public License v3.0
15 stars 5 forks source link

bash script 1a failing to generate consensus sequences #19

Open susanfawcett opened 1 month ago

susanfawcett commented 1 month ago

We're having a trouble with the bash script to for generating consensus sequences. It runs without errors, and generates a file for each locus, but each of those files is blank.

Does anyone have an examples of a script that used to obtain consensus sequences in first step of HybPhaser? The R scripts seem to run OK (output folders are generated, etc.).

The issue seems to be with this part of the script:

else
                    bwa index $CONTIG 2> /dev/null

                    if [ "$READTYPE" = "SE" ];then 
                        bwa mem $CONTIG $READS -t 1 -v 1 2> /dev/null | samtools sort > $BAM
                    else 
                        bwa mem -p $CONTIG $READS -t 1 -v 1 2> /dev/null | samtools sort > $BAM
                    fi
                fi

                bcftools mpileup -I -Ov -f $CONTIG $BAM 2> /dev/null | bcftools call -mv -A -Oz -o  $VCFZ 2> /dev/null
                bcftools index -f --threads 1 $VCFZ 2> /dev/null 
                bcftools consensus -I -i "(DP4[2]+DP4[3])/(DP4[0]+DP4[1]+DP4[2]+DP4[3]) >= $ALLELE_FREQ && (DP4[0]+DP4[1]+DP4[2]+DP4[3]) >= $READ_DEPTH && (DP4[2]+DP4[3]) >= $ALLELE_COUNT " -f $CONTIG $VCFZ 2> /dev/null | awk '{if(NR==1) {print $0} else {if($0 ~ /^>/) {print "\n"$0} else {printf $0}}}' > $CONSENSUS
            echo "" >> $CONSENSUS
LarsNauheimer commented 1 month ago

Hi Susan, The script does work for me. One thing that can lead to getting empty files is being in the conda environment for HybPiper. If you are still in the active conda environment, deactivate it and try again to run the script. Hopefully that resolves your issue.

susanfawcett commented 1 month ago

Thanks, Lars!

I appreciate your prompt response. We are running the script on Berkeley's Savio, and that could very well be our problem.

We'll let you know!

Cheers, Susan Fawcett

On Wed, May 22, 2024 at 12:24 PM Lars Nauheimer @.***> wrote:

Hi Susan, The script does work for me. One thing that can lead to getting empty files is being in the conda environment for HybPiper. If you are still in the active conda environment, deactivate it and try again to run the script. Hopefully that resolves your issue.

— Reply to this email directly, view it on GitHub https://github.com/LarsNauheimer/HybPhaser/issues/19#issuecomment-2125875267, or unsubscribe https://github.com/notifications/unsubscribe-auth/AW3QMJO47FNX2SMQ7U5YEMDZDULKNAVCNFSM6AAAAABIEOPSIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRVHA3TKMRWG4 . You are receiving this because you authored the thread.Message ID: @.***>

ark224 commented 1 month ago

Hi Lars,

I am the person running these scripts for Susan, and I do not run them in a conda envronment. Two questions or you:

thanks, Andy

LarsNauheimer commented 1 month ago

Hi Andy, I do not know how the conda environment influences the script. Sorry to hear that the issue lies somewhere else. In the past I had a problem with the sequence names in the fasta files generated by HybPiper2, which in contrast to HybPiper1 added some more information after a space. But that got fixed.

Assuming you use a namelist.txt, can you try to use the script for a single sample? I believe that was an issue with another user.

spreinalesl commented 1 month ago

Hi all, I had the same issue and fixed it by installing the package BCFtools independently (http://www.htslib.org/download/). Prior to the introduction of HTSlib, SAMtools and BCFtools were distributed in a single samtools-0.1.x package, and that version of BCFtools does not work for Hybphaser. I am working on a conda environment, so that is not the problem, but exporting the path for all binaries (bwa, bcftools) prior to running the 1_generate_consensus_sequences.sh script, may ensure that HybPhaser finds them. In addition, using a list of samples or one sample at a time seems to have different behaviors, as mentioned by @LarsNauheimer. I run one sample per job in a slurm array.

Hopefully that solves your issue @ark224

ark224 commented 3 weeks ago

Thank you, @spreinalesl . We too are in a conda environment, but we did not activate the environment because that gave us problems, so we had to explicitly provide paths to those tools. We found that most of our problems were due to cockpit-error: using incorrect paths, etc.

However, a few were not:


In script 1_generate_consensus_sequences.sh, line 266 - Linux bash does not support the -n option to the wait command until bash v4.3, but the Berkeley HPC is on Red Hat bash v4.2.28. I realize that it might be unreasonable to expect a tool to support a shell that is that old. On the other hand, it is the Berkeley cluster, and as it is currently, this script cannot be run there.

In case anyone else finds themselves with a version of bash prior to v4.3, here is the workaround that we devised. Define a function anywait():

anywait(){

wait for any child process to end

for pid in "$@"; do
    while kill -0 "$pid"; do
        sleep 0.5
    done
done

}

Insert that function after the end of MakeCons() at line 260, and call anywait() instead of "wait -n" on line 266.


The script 1_generate_consensus_sequences_single-core.sh did not run at all for us due to a syntax error: Missing "fi" at line 235.

Once we fixed that, we used this script exclusively in place of 1_generate_consensus_sequences.sh, in the same manner that you did - in a slurm array with one sample per job - using 4 threads per sample to speed up the mapping. Our experiments indicated that on our system, using 4 cores was more efficient than either 2 or 8.


The script 4_merge_sequence_lists.R does not work correctly if you set a value for file_with_samples_included, because in that case samples_include_phased never gets set, and when it is needed at line 86, it is undefined. When we fixed that, then it still did not include the phased samples but did remove the normal sample corresponding to each phased sample. YYMV