cbirdlab / dDocentHPC

hard fork of dDocent, edited to run without interactive user input
2 stars 5 forks source link

mkVCF vcfcombine error #18

Open ARW-UBT opened 6 months ago

ARW-UBT commented 6 months ago

Hello, The dDocentHPC analyses completed almost to the end, but I get at the very end the following error message: vcfcombine: symbol lookup error: /home/bt140047/mambaforge/envs/ddocent/bin/../lib/libvcflib.so.1: undefined symbol: wavefront_align

The ddocent conda env contains that vcflib package:

Name Version Build Channel

vcflib 1.0.9 h146fbdb_4 bioconda

And it seems that freebayes needs this version: $ mamba repoquery whoneeds vcflib Executing the query vcflib Name Version Build Depends Channel ddocent 2.9.4 hdfd78af_1 vcflib >=0.1.11 bioconda freebayes 1.3.7 h1870644_0 vcflib >=1.0.9,<2.0a0 bioconda

Does anybody have a suggestion how to proceed? Best regards,

cbird808 commented 6 months ago

Can you provide the complete output? If you're running this on a SLURM hpc, then it would the "out" file. If you are running on a workstation in a bash terminal, then it would be what is output to the screen.

This will help me understand the settings you are using and also allow me to identify the offending line of code.

ARW-UBT commented 6 months ago

Hello here are some outputs that I have in this project. log-fltrBAM_out.txt log-fltrBAM_out-error.txt log-mkBAM_out.txt

Thanks for your help, Best,

cbird808 commented 6 months ago

These are the types of files that I was looking for.

the log-mkBAM_out.txt and log-fltrBAM_out.txt files don't indicate any fatal errors, but the log-fltrBAM_out-error.txt file does indicate that 4 filtered bam files have a formatting error. Perhaps those 4 have no reads left after filtering?

You should have at least another out file from running mkVCF that would help me with your original issue.

cbird808 commented 6 months ago

Further, dDocentHPC is compatible with the ddocent 2.7.8 dependencies and I do know there are unresolved issues with the 2.9.4 dependencies in dDocentHPC. That is probably why you're getting the mkVCF error. You can load the 2.7.8 dependencies here

conda install -c bioconda ddocent=2.7.8

ARW-UBT commented 6 months ago

You should have at least another out file from running mkVCF that would help me with your original issue.

Here are some output files from a test run, this has also the mkVCF output (the out files from above are from a run that did not reach mkVCF step). fltrBAM.out.txt mkBAM.out.txt mkVCF.out.txt

ARW-UBT commented 6 months ago

conda install -c bioconda ddocent=2.7.8

Thanks for this info, I will try v.2.7.8 now. (Sorry, I have overlooked this recommendation on Github)

ARW-UBT commented 6 months ago

Hi, I have installed ddocent 2.7.8 and started an new run. trimFQ completed successfully, creating mkREF and mkBAM folders mkREF does some calculations, but did not complete successfully. Here is the out file and a copy of additional screen output. out-mkREF.txt out-mkREF-screen.txt

Related to this step: I assume it should be possible to provide an existing reference genome file to dDocentHPC, such as 'reference.fasta' in the classical dDocent script. I have provided a 'reference.fasta' file (also in the dDocentHPC format as 'reference.2.2.fasta') in the mkBAM folder and started the mkBAM analysis. mkBAM completes without error messages, but the -bam files are empty. out-mkBAM.txt

How do I have to prepare an existing reference file for the mkBAM step.

cbird808 commented 6 months ago

Installing ddocent 2.7.8 and bwa-meme seemed to solve the newest mkBAM issue. Will leave this open a little longer to confirm that mkVCF is working now

ARW-UBT commented 6 months ago

Tried it till mkVCF and got this screen message. out-mkVCF.txt out-mkVCF-screen.txt

cbird808 commented 6 months ago

Getting back to the original error, which was not solved by the dDocent conda install package version,

vcfcombine is a command in the vcflib package that quite literally combines together the sub vcf files that were created by each parallelized thread. It's called in line 2247 of dDocentHPC.bash

vcfcombine ./raw.$CUTOFFS.vcf/raw.*.$CUTOFFS.vcf | sed -e 's/   \.\:/   \.\/\.\:/g' > TotalRawSNPs.$CUTOFFS.vcf

My working version is from vcflib v1.0.0-rc0-349-g45c6-dirty. But before you install older versions of vcflib, I don't see how this can be the problem.

I hypothesize that the error is being triggered because there are no vcf files to combine, or they are all 0 bytes. To confirm, I would check to see if the sub vcf files were created by the freebayes command and that their size is >0 bytes. They would be in your dir named raw.2.2.vcf.

solution a

If you do, indeed, have vcf files with data in raw.2.2.vcf, then I would troubleshoot by running the following from the parent dir of raw.2.2.vcf:

CUTOFFS=2.2  #because these are the cutoffs you used in naming your ref
vcfcombine ./raw.$CUTOFFS.vcf/raw.*.$CUTOFFS.vcf | sed -e 's/   \.\:/   \.\/\.\:/g' > TotalRawSNPs.$CUTOFFS.vcf

There are times where the sub vcf files are created in the dDocentHPC script, but they aren't combined into a single final file, but then the same line will run just fine on it's own.

solution b

if you don't have any vcf files or they are all empty, then the vcfcombine error is a symptom rather than the cause. Make sure that freebayes has the files it needs to run. Your settings cause the following line to run freebayes on dDocent line 2214:

ls mapped.*.$CUTOFFS.bed | sed 's/mapped.//g' | sed 's/.bed//g' | shuf | $PARALLEL "freebayes -b cat.$CUTOFFS-RRG.bam -t mapped.{}.bed -v raw.{}.vcf -f reference.$CUTOFFS.fasta -p $PLOIDY -n $BEST_N_ALLELES -m $MIN_MAPPING_QUAL -q $MIN_BASE_QUAL -E $HAPLOTYPE_LENGTH --min-repeat-entropy $MIN_REPEAT_ENTROPY --min-coverage $MIN_COVERAGE -F $MIN_ALT_FRACTION -C $FREEBAYES_C -G $FREEBAYES_G -3 $FREEBAYES_3 -e $FREEBAYES_e -z $FREEBAYES_z -Q $FREEBAYES_Q -U $FREEBAYES_U -$ $FREEBAYES_DOLLAR --populations popmap.$CUTOFFS ${FREEBAYES_r}${FREEBAYES_report_monomorphic}${FREEBAYES_w}${FREEBAYES_V}${FREEBAYES_a}${FREEBAYES_no_partial_observations}"

Check that you have a cat.2.2-RRG.bam file with >0 bytes.

Check that you have mapped.*.2.2.bed files with >0 bytes.

If you are missing either then let me know.

solution c

If you have the files necessary for freebayes to run, then I propose the most likely cause of no vcf or empty vcf files would be the freebayes settings. You've thinned your data for quick testing, and you consequently may need to adjust the settings to be more lenient. I'd look at these in particular:

# in the config file under mkVCF:

2       Minimum Mean Depth of Coverage Per Individual                   Limit analysis to contigs with at least the specified mean depth of coverage per individual

0.375   freebayes -F --min-alternate-fraction (decimal 0-1)             There must be at least 1 individual with this fraction of alt reads to evaluate the position. If your individuals are barcoded, then use 0.2. If your data is pooled, then set based upon ~ 1/(numIndivids * ploidy) and average depth of coverage

20      freebayes -C --min-alternate-count (integer)                    Require at least this count of observations supporting an alternate allele within a single individual in order to evaluate the position.  default: 2

20      freebayes -G --min-alternate-total (integer)                    Require at least this count of observations supporting an alternate allele within the total population in order to use the allele in analysis.  default: 1

# this one might take too long to run if you change it
no      freebayes    --report-monomorphic (no|yes)                      Report even loci which appear to be monomorphic, and report allconsidered alleles, even those which are not in called genotypes. Loci which do not have any potential alternates have '.' for ALT.

Let me know how this goes

cbird808 commented 6 months ago

And for what it's worth, here are the packages in the container I use to run dDocentHPC on an hpc:

cbird@e3-w6420b-20:~$ crun conda list
# packages in environment at /opt/ddocent:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       0_gnu    conda-forge
atk                       2.36.0                        0    conda-forge
atk-1.0                   2.36.0               haf93ef1_0    conda-forge
bedtools                  2.29.2               hc088bd4_0    bioconda
bwa                       0.7.17               hed695b0_7    bioconda
bzip2                     1.0.8                h516909a_2    conda-forge
ca-certificates           2020.4.5.1           hecc5488_0    conda-forge
cairo                     1.16.0            hcf35c78_1003    conda-forge
cd-hit                    4.8.1                hdbcaa40_0    bioconda
certifi                   2020.4.5.1       py37hc8dfbb8_0    conda-forge
chrpath                   0.16              h14c3975_1001    conda-forge
coreutils                 8.31                 h516909a_0    conda-forge
curl                      7.68.0               h33f0ec9_1    conda-forge
dbus                      1.13.6               he372182_0    conda-forge
ddocent                   2.7.8                         3    bioconda
expat                     2.2.9                he1b5a44_2    conda-forge
fastp                     0.20.0               hdbcaa40_0    bioconda
fontconfig                2.13.1            h86ecdb6_1001    conda-forge
freebayes                 1.3.2            py37hc088bd4_0    bioconda
freetype                  2.10.1               he06d7ca_0    conda-forge
fribidi                   1.0.9                h516909a_0    conda-forge
gdk-pixbuf                2.38.2               h3f25603_3    conda-forge
gettext                   0.19.8.1          hc5be6a0_1002    conda-forge
giflib                    5.2.1                h516909a_2    conda-forge
glib                      2.58.3          py37he00f558_1003    conda-forge
gnuplot                   5.2.7                h0fb2448_3    conda-forge
gobject-introspection     1.58.2          py37h619baee_1004    conda-forge
graphite2                 1.3.13            he1b5a44_1001    conda-forge
grep                      2.14                 h14c3975_3    bioconda
gst-plugins-base          1.14.5               h0935bb2_2    conda-forge
gstreamer                 1.14.5               h36ae1b5_2    conda-forge
gtk2                      2.24.32              h586f36d_1    conda-forge
harfbuzz                  2.4.0                h9f30f68_3    conda-forge
htslib                    1.9                  h4da6232_3    bioconda
icu                       64.2                 he1b5a44_1    conda-forge
jpeg                      9c                h14c3975_1001    conda-forge
krb5                      1.17.1               h2fd8d38_0    conda-forge
ld_impl_linux-64          2.34                 h53a641e_0    conda-forge
libclang                  9.0.1           default_hde54327_0    conda-forge
libcurl                   7.68.0               hf7181ac_1    conda-forge
libdeflate                1.5                  h516909a_0    conda-forge
libedit                   3.1.20170329      hf8c457e_1001    conda-forge
libffi                    3.2.1             he1b5a44_1007    conda-forge
libgcc-ng                 9.2.0                h24d8f2e_2    conda-forge
libgd                     2.2.5             h307a58e_1007    conda-forge
libgomp                   9.2.0                h24d8f2e_2    conda-forge
libiconv                  1.15              h516909a_1006    conda-forge
libllvm9                  9.0.1                hc9558a2_0    conda-forge
libpng                    1.6.37               hed695b0_1    conda-forge
libssh2                   1.8.2                h22169c7_2    conda-forge
libstdcxx-ng              9.2.0                hdf63c60_2    conda-forge
libtiff                   4.1.0                hc3755c2_3    conda-forge
libuuid                   2.32.1            h14c3975_1000    conda-forge
libwebp                   1.0.2                h56121f0_5    conda-forge
libxcb                    1.13              h14c3975_1002    conda-forge
libxkbcommon              0.10.0               he1b5a44_0    conda-forge
libxml2                   2.9.10               hee79883_0    conda-forge
lz4-c                     1.8.3             he1b5a44_1001    conda-forge
mawk                      1.3.4                h14c3975_2    bioconda
ncurses                   6.1               hf484d3e_1002    conda-forge
nspr                      4.25                 he1b5a44_0    conda-forge
nss                       3.47                 he751ad9_0    conda-forge
openssl                   1.1.1f               h516909a_0    conda-forge
pango                     1.42.4               h7062337_4    conda-forge
parallel                  20200322                      0    conda-forge
pcre                      8.44                 he1b5a44_0    conda-forge
pear                      0.9.6                h98de208_5    bioconda
perl                      5.26.2            h516909a_1006    conda-forge
pip                       20.0.2                     py_2    conda-forge
pixman                    0.38.0            h516909a_1003    conda-forge
pthread-stubs             0.4               h14c3975_1001    conda-forge
python                    3.7.6           h8356626_5_cpython    conda-forge
python_abi                3.7                     1_cp37m    conda-forge
qt                        5.12.5               hd8c4c69_1    conda-forge
rainbow                   2.0.4                h14c3975_3    bioconda
readline                  8.0                  hf8c457e_0    conda-forge
samtools                  1.9                 h10a08f8_12    bioconda
sed                       4.7               h1bed415_1000    conda-forge
seqtk                     1.3                  hed695b0_2    bioconda
setuptools                46.1.3           py37hc8dfbb8_0    conda-forge
sqlite                    3.30.1               hcee41ef_0    conda-forge
tk                        8.6.10               hed695b0_0    conda-forge
unzip                     6.0                  h516909a_0    conda-forge
vcflib                    1.0.0_rc3        py37hc088bd4_0    bioconda
vcftools                  0.1.16               he860b03_3    bioconda
wheel                     0.34.2                     py_1    conda-forge
xorg-kbproto              1.0.7             h14c3975_1002    conda-forge
xorg-libice               1.0.10               h516909a_0    conda-forge
xorg-libsm                1.2.3             h84519dc_1000    conda-forge
xorg-libx11               1.6.9                h516909a_0    conda-forge
xorg-libxau               1.0.9                h14c3975_0    conda-forge
xorg-libxdmcp             1.1.3                h516909a_0    conda-forge
xorg-libxext              1.3.4                h516909a_0    conda-forge
xorg-libxrender           0.9.10            h516909a_1002    conda-forge
xorg-libxt                1.2.0                h516909a_0    conda-forge
xorg-renderproto          0.11.1            h14c3975_1002    conda-forge
xorg-xextproto            7.3.0             h14c3975_1002    conda-forge
xorg-xproto               7.0.31            h14c3975_1007    conda-forge
xz                        5.2.5                h516909a_0    conda-forge
zlib                      1.2.11            h516909a_1006    conda-forge
zstd                      1.4.4                h3b9ef0a_2    conda-forge
ARW-UBT commented 6 months ago

Check that you have a cat.2.2-RRG.bam file with >0 bytes.

Check that you have mapped.*.2.2.bed files with >0 bytes.

I do have the cat-file: 241M 9. Jan 16:13 cat.2.2-RRG.bam

I do have a mapped bed file: 362K 9. Jan 16:13 mapped.2.2.bed but I am not sure whether ther should me more of them according to your wildcard.

The raw.2.2.vcf folder is indeed empty.

cbird808 commented 6 months ago

There's a mawk script starting on line 2121 of dDocentHPC.bash that splits the bed file into N files based upon the number of threads available. If you don't have a set of bed files then I think that's where the issue is. In the present dDocentHPC.bash script, the line of code that deletes those bed files is commented out, so they should remain. You can try copying this mawk script on 2121 into a text editor, remove the tabs, then paste into your terminal and run

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: ARW-UBT @.> Sent: Wednesday, January 10, 2024 3:06:42 AM To: cbirdlab/dDocentHPC @.> Cc: Bird, Chris @.>; Comment @.> Subject: Re: [cbirdlab/dDocentHPC] mkVCF vcfcombine error (Issue #18)

Check that you have a cat.2.2-RRG.bam file with >0 bytes.

Check that you have mapped.*.2.2.bed files with >0 bytes.

I do have the cat-file: 241M 9. Jan 16:13 cat.2.2-RRG.bam

I do have a mapped bed file: 362K 9. Jan 16:13 mapped.2.2.bed but I am not sure whether ther should me more of them according to your wildcard.

The raw.2.2.vcf folder is indeed empty.

— Reply to this email directly, view it on GitHubhttps://github.com/cbirdlab/dDocentHPC/issues/18#issuecomment-1884364628, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBV4S5Y2LCHQG6NEE35TETYNZDZFAVCNFSM6AAAAABA7TMZZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBUGM3DINRSHA. You are receiving this because you commented.Message ID: @.***>

cbird808 commented 6 months ago

Additionally, to run the mawk script, you need to first set the variables

CUTOFFS=2.2 NUMProc=4 #this will cause 4 bed files to be made

and run the if then statement on 2111 (remove the tabs)

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Bird, Chris @.> Sent: Wednesday, January 10, 2024 5:20:37 AM To: cbirdlab/dDocentHPC @.>; cbirdlab/dDocentHPC @.> Cc: Comment @.> Subject: Re: [cbirdlab/dDocentHPC] mkVCF vcfcombine error (Issue #18)

There's a mawk script starting on line 2121 of dDocentHPC.bash that splits the bed file into N files based upon the number of threads available. If you don't have a set of bed files then I think that's where the issue is. In the present dDocentHPC.bash script, the line of code that deletes those bed files is commented out, so they should remain. You can try copying this mawk script on 2121 into a text editor, remove the tabs, then paste into your terminal and run

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: ARW-UBT @.> Sent: Wednesday, January 10, 2024 3:06:42 AM To: cbirdlab/dDocentHPC @.> Cc: Bird, Chris @.>; Comment @.> Subject: Re: [cbirdlab/dDocentHPC] mkVCF vcfcombine error (Issue #18)

Check that you have a cat.2.2-RRG.bam file with >0 bytes.

Check that you have mapped.*.2.2.bed files with >0 bytes.

I do have the cat-file: 241M 9. Jan 16:13 cat.2.2-RRG.bam

I do have a mapped bed file: 362K 9. Jan 16:13 mapped.2.2.bed but I am not sure whether ther should me more of them according to your wildcard.

The raw.2.2.vcf folder is indeed empty.

— Reply to this email directly, view it on GitHubhttps://github.com/cbirdlab/dDocentHPC/issues/18#issuecomment-1884364628, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBV4S5Y2LCHQG6NEE35TETYNZDZFAVCNFSM6AAAAABA7TMZZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBUGM3DINRSHA. You are receiving this because you commented.Message ID: @.***>

ARW-UBT commented 6 months ago

In order to check if the thinned out data in my test run is causing some of the issues, I decided to use my old 150 bp paired end date from the previous dDocent analyses again (which were sucessfully analyzed by dDocent (and Stacks)) and to feed them into dDocentHPC: I started with the trimmed R1 and R2 reads:

The mkVCF step started and was qiute busy for > 20 minutes, but did stop with the error below. vcfcombine: symbol lookup error: /home/bt140047/miniconda3/envs/ddocent-2.7.8/bin/../lib/libvcflib.so.1: undefined symbol: wavefront_align

Now, however, without the messages related to the missing vcf or bed files (they are there). It seems to me that the vcfcombine step is causing the error message on the screen, as I can reproduce it by manually running the 'solution a' instruction mentioned above: CUTOFFS=2.2 vcfcombine ./raw.$CUTOFFS.vcf/raw.*.$CUTOFFS.vcf | sed -e 's/ \.\:/ \.\/\.\:/g' > TotalRawSNPs.$CUTOFFS.vcf

ARW-UBT commented 6 months ago

Oops! I just realized that the 33 vcf files in the raw folder were created, but are emty... Probably due to filtering settings... Sorry for not havin seen this before...

cbird808 commented 6 months ago

That's progress that you have the vcf files, troubleshooting combining them together should be straight forward.

I had a conversation with chatgpt and it's primary conclusion "that there's a mismatch between the vcfcombine executable and the version of the library (libvcflib.so.1) it is trying to use" seems reasonable. I suspect that vcfcombine is that installed by 2.7.8 (try vcfcombine -v in the two different environments) and the libvcflib.so.1 being accessed might be the newer version from vcflib (1.0.9 is the newest version on github)

As a quick test, you might try changing to the newer ddocent environment and try running the vcfcombine step again. If you look over the chatgpt transcript, it also suggests some solutions to resolve the issue in the older ddocent environment. I additionally tried to have chatgpt narrow down where the libvcflib.so.1 files are located in your dir.

ARW-UBT commented 6 months ago

The vcf files in the raw.2.2.vcf directory are empty (0 Bytes), I assume the vcfcombine will not work. Since the bed files are not empty, would it make sense to adjust config settings for mkVCF and run mkVCF in order to get anything written into the vcf files? If so, which parameters should be changed in order to avoid filtering out too much. Just to see, whether this works, and I could then iteratively try to adjust the mkVCF config settings.

After that I would rerun vcfcombine from the newest dDocent environment.

Does this make sense?