anands-repo / hello

DNN-based small variant caller
MIT License
13 stars 0 forks source link

Human WGS data sequenced by BGI and installation of HELLO #5

Closed robertzeibich closed 2 years ago

robertzeibich commented 2 years ago

I have human WGS data sequenced by BGI. Is my data appropriate for HELLO? If yes, which pipeline (Illumina, PacBio) should I use?

anands-repo commented 2 years ago

For shorter reads, Illumina models may be used, and for longer reads, the PacBio HiFi models may be used. However, the models are trained on the Illumina and PacBio HiFi error profiles, and not optimized for other sequencing platforms. I would say if there is a model/algorithm that is trained on BGI reads from another tool, that will likely provide better performance than HELLO.

robertzeibich commented 2 years ago

Thanks for your quick response. I want to run the Illumina variant calling option on our high performance cluster (HPC) in a singularity container for safety reasons.

However, currently, I am trying to understand how to get the docker container running. Steps already taken:

  1. I pulled the docker container: docker pull oddjobs/hello_image.x86_64
  2. Looked inside the container: docker run -it oddjobs/hello_image.x86_64 bash I found two sh files and one directory: Miniconda3-latest-Linux-x86_64.sh bazel-0.26.1-installer-linux-x86_64.sh miniconda3 (directory)
  3. I was also looking for the python/call.py inside the docker container, but without success.

Can you tell me how to use the docker container?

When do I need to build the tool? I git cloned the repository and tried to build the tool (cmake . && make -j 12), but without success.

I am also a little confused by the note "To properly download models, git-lfs needs to be installed. Once installed, please do git lfs pull inside the git repo". Is git-lfs already installed in the docker container?

anands-repo commented 2 years ago

The docker image only contains the dependencies, not hello itself. Take a look at the following steps to build it:

# Launch container
docker run -it oddjobs/hello_image.x86_64 /bin/bash

# Inside container
apt-get -y update && apt-get -y install git && git clone https://github.com/anands-repo/hello.git
mkdir git-lfs
cd git-lfs/
wget https://github.com/git-lfs/git-lfs/releases/download/v3.1.2/git-lfs-linux-amd64-v3.1.2.tar.gz  # Or another version of git lfs
tar xvfz git-lfs-linux-amd64-v3.1.2.tar.gz
./install.sh
cd
git lfs install
cd hello/
git lfs pull
cmake . && make -j 12
robertzeibich commented 2 years ago

Thanks for your quick response again. I am getting closer, but I am not there yet.

I executed the steps you recommended. Thereafter, I executed a docker commit to make the changes permanent and afterwards, I tried to convert the docker image into a singularity container using singularity 3.9.2.

Steps:

  1. docker commit c3f279d17e0a hello
  2. singularity build hello.sif docker-daemon://hello:latest

I noticed that in the docker container HELLO is stored under /root/hello. In the singularity container I do not have access to root. Do you have an idea what I could do because I can only run HELLO in a singularity container on our HPC due to security reasons.

anands-repo commented 2 years ago

This is a bit of a sticky problem. Does singularity only play well with certain paths like /opt? Currently some dependencies are in /usr etc if I am not mistaken - do not know whether those are also blocked by singularity.

I don't know whether the following will work, but I can think of three steps to try assuming that /opt works fine - if not, replace with appropriate path that may work (steps 2 and 3 if /usr is not supported by singularity):

  1. Move miniconda3 to /opt (it is currently in /root). After the move, the PATH environment may need to be changed to add the new path of miniconda3/bin. Refer: https://www.anaconda.com/blog/moving-conda-environments
  2. If BOOST is not installed in /opt, install (build from source) BOOST with boost::python support in /opt. Note that boost with boost::python support needs a few changes to the standard boost compilation flow. Change CMakeLists.txt file in hello (see commented lines at the start of the file) accordingly. I do not remember off the top of my head, but I believe I used boost 1.68. This is the step that can lead to some difficulties - it may need some trial and error to get everything working together (boost installation, CMakeLists.txt etc), especially given that there already exists a boost installation.
  3. Install GNU parallel in /opt, and update PATH appropriately
robertzeibich commented 2 years ago

Thanks for the reply. I also got some IT support at my end now. Hopefully, we will get the tool running based on your suggestions.

I also have some other questions which I would like to have answered while we are working on getting the tool running. How to use the num_threads? Is there a min or max limit? I guess the workdir points to the output file directory. Is that correct?

anands-repo commented 2 years ago

num_threads is the number of CPU cores to be used in each run. Correct - workdir points to the output directory.

robertzeibich commented 2 years ago

The tools is running, but I currently get the following error message (slurm output):

ETA: 0s Left: 1 AVG: 0.63s local:1/10985/100%/0.6s ETA: 0s Left: 0 AVG: 0.63s local:0/10986/100%/0.6s INFO:root:Completed runs, checking log files ERROR:root:File /projects/xm41/robert/hello/GALF003/features_bams_GALF003bam__/features0.log doesn't have termination string

In the workdir, the following files where created: features_bams_GALF003bam hotspotsbamsGALF003bamchr16 hotspotsbamsGALF003bamchr3 hotspotsbamsGALF003bamchr1 hotspotsbamsGALF003bamchr17 hotspotsbamsGALF003bamchr4 hotspotsbamsGALF003bamchr10 hotspotsbamsGALF003bamchr18 hotspotsbamsGALF003bamchr5 hotspotsbamsGALF003bamchr11 hotspotsbamsGALF003bamchr19 hotspotsbamsGALF003bamchr6 hotspotsbamsGALF003bamchr12 hotspotsbamsGALF003bamchr2 hotspotsbamsGALF003bamchr7 hotspotsbamsGALF003bamchr13 hotspotsbamsGALF003bamchr20 hotspotsbamsGALF003bamchr8 hotspotsbamsGALF003bamchr14 hotspotsbamsGALF003bamchr21 hotspotsbamsGALF003bamchr9 hotspotsbamsGALF003bamchr15 hotspotsbamsGALF003__bamchr22

Do you have an idea what might be wrong and how I can fix that issue? Do you have an example somewhere I could try out? I would just like to make sure that the tool runs as supposed to.

anands-repo commented 2 years ago

If you wanted to try an Illumina dataset to do sanity checking, you can do one from Genome In a Bottle. I haven’t directly used them, but I believe one of those should work directly. However, we are also checking whether the installation process is correct. If there is an issue with installation, neither case will provide an output on the platform you are employing.

It may be easier to send to me the run command and the contents of /projects/xm41/robert/hello/GALF003/features_bams_GALF003bam__/features0.log

My suspicion is that boost::Python isn’t the right version or hasn’t been installed correctly. Standard boost installation or build from source doesn’t come with boost::Python enabled. The build files need to be modified a little for that and it wasn’t documented well when I tried to install it.

On Thu, Apr 14, 2022 at 9:37 PM robertzeibich @.***> wrote:

The tools is running, but I currently get the following error message (slurm output):

ETA: 0s Left: 1 AVG: 0.63s local:1/10985/100%/0.6s ETA: 0s Left: 0 AVG: 0.63s local:0/10986/100%/0.6s INFO:root:Completed runs, checking log files ERROR:root:File /projects/xm41/robert/hello/GALF003/features_bams_GALF003bam__/features0.log doesn't have termination string

In the workdir, the following files where created: features_bams_GALF003bam hotspotsbamsGALF003bamchr16 hotspotsbamsGALF003bamchr3 hotspotsbamsGALF003bamchr1 hotspotsbamsGALF003bamchr17 hotspotsbamsGALF003bamchr4 hotspotsbamsGALF003bamchr10 hotspotsbamsGALF003bamchr18 hotspotsbamsGALF003bamchr5 hotspotsbamsGALF003bamchr11 hotspotsbamsGALF003bamchr19 hotspotsbamsGALF003bamchr6 hotspotsbamsGALF003bamchr12 hotspotsbamsGALF003bamchr2 hotspotsbamsGALF003bamchr7 hotspotsbamsGALF003bamchr13 hotspotsbamsGALF003bamchr20 hotspotsbamsGALF003bamchr8 hotspotsbamsGALF003bamchr14 hotspotsbamsGALF003bamchr21 hotspotsbamsGALF003bamchr9 hotspotsbamsGALF003bamchr15 hotspotsbamsGALF003__bamchr22

Do you have an idea what might be wrong and how I can fix that issue? Do you have an example somewhere I could try out? I would just like to make sure that the tools behaves in a way the tool is supposed to behave.

— Reply to this email directly, view it on GitHub https://github.com/anands-repo/hello/issues/5#issuecomment-1099843567, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJJMQEHD3PMV32J5SUIC7TVFDW7HANCNFSM5SIFJ2LQ . You are receiving this because you commented.Message ID: @.***>

robertzeibich commented 2 years ago

Which version of boost::Python do you recommend using/installing? Do you know how I need to modify the build files?

I guess my problem has something to do with the following directory: /tmp/vcftemp

1) WITHOUT /tmp:vcfemp If I run the script without /tmp:vcfemp bound to the container, I will receive the following error: Script:

REF="/scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta"
NUM_THREADS=8
workdir="/projects/xm41/robert/hello"
hello="/usr/local/hello/709b7c3/"
mkdir -p /tmp/vcftemp
module load singularity/3.9.2
#-B /tmp/vcftemp:/tmp/vcftemp
singularity run -B /fs02 -B /fs04 -B /fs03:/fs03 -B /projects:/projects -B /scratch:/scratch ${hello}/hello.sif \
python /opt/hello/python/call.py \
--ibam GALF003.p0.03.bam \
--ref $REF \
--network models/illumina_multi_coverage_mapq_threshold_hg002_continue_run16.wrapper.dnn \
--num_threads $NUM_THREADS \
--workdir $workdir/GALF003_1x_tmp --mapq_threshold 5

Error:

INFO:root:Combining all hotspots and sharding
2022-04-16 10:57:52,214 Directory /projects/xm41/robert/hello/GALF003_1x_tmp/hotspots_hello___GALF003__p0__03__bam_chr22_ exists
2022-04-16 10:57:52,214 Counting hotspots
2022-04-16 10:57:52,214 Sharding with >= 0 items per cluster
INFO:root:Completed sharding, creating caller commands for dumping training data
INFO:root:Created call command for chromosome chr22
INFO:root:Launching all caller commands
Computers / CPU cores / Max jobs to run
1:local / 8 / 1
ETA: 0s Left: 0 AVG: 0.00s  0
INFO:root:Completed runs, checking log files
INFO:root:All commands completed correctly, preparing vcf
sort -k1,1d -k2,2n
sort -k1,1d -k2,2n
Search string /tmp/vcftemp/*expert0.vcf
Found 0 files
Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.ngs.tmp.vcf']
Search string /tmp/vcftemp/*expert1.vcf
Found 0 files
Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.tgs.tmp.vcf']
Search string /tmp/vcftemp/*expert2.vcf
Found 0 files
Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.ngs_tgs.tmp.vcf']
Search string /tmp/vcftemp/*best.vcf
Found 0 files
Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.best.tmp.vcf']
Search string /tmp/vcftemp/*mean.vcf
Found 0 files
Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.mean.tmp.vcf']
Choice histogram = {0: 0, 1: 0, 2: 0}

2) WITH /tmp:vcfemp If I run the script with /tmp:vcfemp bound to the container, I will receive the following error: Script:

REF="/scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta"
NUM_THREADS=8
workdir="/projects/xm41/robert/hello"
hello="/usr/local/hello/709b7c3/"
mkdir -p /tmp/vcftemp
module load singularity/3.9.2
singularity run -B /fs02 -B /fs04 -B /fs03:/fs03 -B /projects:/projects -B /scratch:/scratch -B /tmp/vcftemp:/tmp/vcftemp ${hello}/hello.sif \
python /opt/hello/python/call.py \
--ibam GALF003.p0.03.bam \
--ref $REF \
--network models/illumina_multi_coverage_mapq_threshold_hg002_continue_run16.wrapper.dnn \
--num_threads $NUM_THREADS \
--workdir $workdir/GALF003_1x_tmp --mapq_threshold 5

Error:

ETA: 0s Left: 4 AVG: 0.03s  local:4/497/100%/0.0s
ETA: 0s Left: 1 AVG: 0.03s  local:1/500/100%/0.0s
ETA: 0s Left: 0 AVG: 0.03s  local:0/501/100%/0.0s
INFO:root:Combining all hotspots and sharding
2022-04-16 10:45:26,049 Directory /projects/xm41/robert/hello/GALF003_1x_tmp/hotspots_hello___GALF003__p0__03__bam_chr22_ exists
2022-04-16 10:45:26,050 Counting hotspots
2022-04-16 10:45:26,050 Sharding with >= 0 items per cluster
INFO:root:Completed sharding, creating caller commands for dumping training data
INFO:root:Created call command for chromosome chr22
INFO:root:Launching all caller commands
Computers / CPU cores / Max jobs to run
1:local / 8 / 1
ETA: 0s Left: 0 AVG: 0.00s  0
INFO:root:Completed runs, checking log files
INFO:root:All commands completed correctly, preparing vcf
Traceback (most recent call last):
  File "/opt/hello/python/prepareVcf.py", line 255, in <module>
    shutil.rmtree(args.tmpdir);
  File "/opt/miniconda3/lib/python3.7/shutil.py", line 498, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/opt/miniconda3/lib/python3.7/shutil.py", line 496, in rmtree
    os.rmdir(path)
OSError: [Errno 16] Device or resource busy: '/tmp/vcftemp'

Unfortunately, I cannot create the problem I mentioned in my previous message. However, when I look in the folder "/projects/xm41/robert/hello/GALF003/features_bams_GALF003bam__/features0.log", the file does not exist.

I also had a chat with our technical support for the high performance cluster. The person I talked to wanted to know if there is a possibility to use /scratch instead of /tmp because of space limitations and the directory /tmp is not shared between nodes.

anands-repo commented 2 years ago

Sorry, the previous mail was sent premature.

For all inputs not in home directory, can we check whether the inputs exist? For example, run “du -sh filename” inside the singularity container. Can we direct all outputs to your home directory? I believe singularity auto mounts your home directory at the same path inside a container, so there is no need to mount anything for this purpose - paths in your home directory should work directly inside the singularity container. (I could be wrong on this based on the cluster setup). Would like the singularity Mount issues to be resolved, otherwise we could be debugging the wrong issue. I also see use of both relative and absolute paths. Let’s make all paths absolute.

NOTE: Please cleanup the output directory before each run.

On Fri, Apr 15, 2022 at 10:33 PM Anand Ramachandran @.***> wrote:

Right. I think the similarity commands and the Mount structure needs to be simplified.

On Fri, Apr 15, 2022 at 6:04 PM robertzeibich @.***> wrote:

Which version of boost::Python do you recommend installing?

I guess my problem has something to do with the following directory: /tmp/vcftemp

If I run the script without /tmp:vcfemp bound to the container, I will receive the following error: Script:

REF="/scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta" NUM_THREADS=8 workdir="/projects/xm41/robert/hello" hello="/usr/local/hello/709b7c3/" mkdir -p /tmp/vcftemp module load singularity/3.9.2

-B /tmp/vcftemp:/tmp/vcftemp

singularity run -B /fs02 -B /fs04 -B /fs03:/fs03 -B /projects:/projects -B /scratch:/scratch ${hello}/hello.sif python /opt/hello/python/call.py --ibam GALF003.p0.03.bam --ref $REF --network models/illumina_multi_coverage_mapq_threshold_hg002_continue_run16.wrapper.dnn

--num_threads $NUM_THREADS --workdir $workdir/GALF003_1x_tmp --mapq_threshold 5

Error: INFO:root:Combining all hotspots and sharding 2022-04-16 10:57:52,214 Directory /projects/xm41/robert/hello/GALF003_1x_tmp/hotspots_hello_GALF003p003bamchr22 exists 2022-04-16 10:57:52,214 Counting hotspots 2022-04-16 10:57:52,214 Sharding with >= 0 items per cluster INFO:root:Completed sharding, creating caller commands for dumping training data INFO:root:Created call command for chromosome chr22 INFO:root:Launching all caller commands Computers / CPU cores / Max jobs to run 1:local / 8 / 1 ETA: 0s Left: 0 AVG: 0.00s 0 INFO:root:Completed runs, checking log files INFO:root:All commands completed correctly, preparing vcf sort -k1,1d -k2,2n sort -k1,1d -k2,2n Search string /tmp/vcftemp/expert0.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.ngs.tmp.vcf'] Search string /tmp/vcftemp/expert1.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.tgs.tmp.vcf'] Search string /tmp/vcftemp/expert2.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.ngs_tgs.tmp.vcf'] Search string /tmp/vcftemp/best.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.best.tmp.vcf'] Search string /tmp/vcftemp/*mean.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.mean.tmp.vcf'] Choice histogram = {0: 0, 1: 0, 2: 0}

If I run the script with /tmp:vcfemp bound to the container, I will receive the following error: Script:

REF="/scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta" NUM_THREADS=8 workdir="/projects/xm41/robert/hello" hello="/usr/local/hello/709b7c3/" mkdir -p /tmp/vcftemp module load singularity/3.9.2

singularity run -B /fs02 -B /fs04 -B /fs03:/fs03 -B /projects:/projects -B /scratch:/scratch -B /tmp/vcftemp:/tmp/vcftemp ${hello}/hello.sif python /opt/hello/python/call.py --ibam GALF003.p0.03.bam --ref $REF --network models/illumina_multi_coverage_mapq_threshold_hg002_continue_run16.wrapper.dnn

--num_threads $NUM_THREADS --workdir $workdir/GALF003_1x_tmp --mapq_threshold 5

Error: ETA: 0s Left: 4 AVG: 0.03s local:4/497/100%/0.0s ETA: 0s Left: 1 AVG: 0.03s local:1/500/100%/0.0s ETA: 0s Left: 0 AVG: 0.03s local:0/501/100%/0.0s INFO:root:Combining all hotspots and sharding 2022-04-16 10:45:26,049 Directory /projects/xm41/robert/hello/GALF003_1x_tmp/hotspots_hello_GALF003p003bamchr22 exists 2022-04-16 10:45:26,050 Counting hotspots 2022-04-16 10:45:26,050 Sharding with >= 0 items per cluster INFO:root:Completed sharding, creating caller commands for dumping training data INFO:root:Created call command for chromosome chr22 INFO:root:Launching all caller commands Computers / CPU cores / Max jobs to run 1:local / 8 / 1 ETA: 0s Left: 0 AVG: 0.00s 0 INFO:root:Completed runs, checking log files INFO:root:All commands completed correctly, preparing vcf Traceback (most recent call last): File "/opt/hello/python/prepareVcf.py", line 255, in shutil.rmtree(args.tmpdir); File "/opt/miniconda3/lib/python3.7/shutil.py", line 498, in rmtree onerror(os.rmdir, path, sys.exc_info()) File "/opt/miniconda3/lib/python3.7/shutil.py", line 496, in rmtree os.rmdir(path) OSError: [Errno 16] Device or resource busy: '/tmp/vcftemp'

Unfortunately, I cannot create the problem I mentioned in my previous message.

— Reply to this email directly, view it on GitHub https://github.com/anands-repo/hello/issues/5#issuecomment-1100493038, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJJMQHHVJ2I2UWMQOWJ5VLVFIG3BANCNFSM5SIFJ2LQ . You are receiving this because you commented.Message ID: @.***>

robertzeibich commented 2 years ago

The inputs exist inside the singularity container: 3.1G /scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta 16K /opt/hello/python/call.py 4.5G /scratch/xm41/ct/bamsDown/step1_bams1x/GALF003_1x.bam access to /projects is also given

I cannot use my home directory because I only have 10GB available. When I run call.py with absolute paths + workdir /projects/xm41/robert/hello/GALF003_1x_test4/ inside the singularity container, the following files are created:

The bed and vcf files do not have an entry, just a header with no variants listed. The console output is the same as mentioned in my previous post under 1) WITHOUT /tmp:vcfemp Error.

How I entered the singularity container:

module load singularity/3.9.2
singularity run -B /fs02 -B /fs04 -B /fs03:/fs03 -B /projects:/projects -B /scratch:/scratch /usr/local/hello/709b7c3/hello.sif 

Code used inside the singularity container:

/opt/hello/python/call.py \
--ibam /scratch/xm41/ct/bamsDown/step1_bams1x/GALF003_1x.bam \
--ref /scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta \
--network models/illumina_multi_coverage_mapq_threshold_hg002_continue_run16.wrapper.dnn \
--num_threads 36 \
--workdir /projects/xm41/robert/hello/GALF003_1x_test4 --mapq_threshold 5

I would also be available for a Zoom call to sort this out. I am in Melbourne. What is your time zone?

anands-repo commented 2 years ago

[UPDATED]

[UPDATED 2]

Are you using a 1x coverage BAM file? A whole genome BAM with 4.5GB size and a suffix _1x seem to indicate as much.

Anyway, let’s continue - unfortunately, I am not able to do a Zoom for several weeks as I am away on a family matter. What I can do is review some logs etc on my phone for these several weeks. My schedule is non-existent and I pick up any odd 5 minutes available at some point in time during the day.

There are .sh and .log files inside the features and hotspot files. They are numbered, say 0.sh, 0.log etc. could you pick one instance of each (say numbered 10) from one of the hotspots directory and the features directory and upload it to Dropbox/box/Google drive and share it with me?

Ideally, I would want the full working directory for a small chromosome, say, 22, but I would need access to a computer to look at a tar.gz. If you think it’a doable, please share a tar.gz of the workdir for chromosome 22. I will look for computer access.

robertzeibich commented 2 years ago

I just read your message and I am looking for a bam file currently because I am not allowed to share data from the server. I am following your idea with genome in a bottle: https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/ChineseTrio/alignment.index.ChineseTrio_Illumina300X100X_wgs_novoalign_GRCh37_GRCh38_NHGRI_04062016.HG007 I thought of downloading the first one: ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ChineseTrio/HG007_NA24695-hu38168_mother/NA24695_Mother_HiSeq100x/NHGRI_Illumina100X_Chinesetrio_novoalign_bams/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam Do you think this could work? If you have another file in mind, please let me know.

anands-repo commented 2 years ago

Take a look at this:

ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/NHGRI_Illumina300X_AJtrio_novoalign_bams/

They have 60x BAM files, which must be quite a lot smaller.

robertzeibich commented 2 years ago

HELLO works on a coverage of 30x WGS BGI data and the same data downsampled to a coverage of 1x. The error was not using the correct network path. Take away: Check all paths. Thx for your help again. Do you know if I need to run a kind of quality control (filtering) on the VCF files like VQSR or has that already been covered by the tool?

anands-repo commented 2 years ago

Great to hear that!

There is no additional filtration to be run on the VCF. But there is a small error in the VCF header. Some VCF processing commands work without fixing it, but some may not. In any case, it’s very simple to fix the issue and how to fix it is given at the end of the README file, if you scroll down.

Thanks again, for debugging it. It’s nice to hear it worked out in your use case.

On Wed, Apr 20, 2022 at 4:28 AM robertzeibich @.***> wrote:

HELLO works on a coverage of 30x WGS BGI data and the same data downsampled to a coverage of 1x. The error was not using the correct network path. Take away: Check all paths. Thx for your help again. Do you know if I need to run a kind of quality control on the VCF files like VQSR or has that already been covered by the tool itself?

— Reply to this email directly, view it on GitHub https://github.com/anands-repo/hello/issues/5#issuecomment-1103823605, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJJMQD7ROLR24VC5SL6LLDVF7S4LANCNFSM5SIFJ2LQ . You are receiving this because you commented.Message ID: @.***>

anands-repo commented 2 years ago

A final note: by default HELLO calls only chromosomes 1-22. To change this behavior, one would need to pass the —chromosomes option.

Will close the issue in a couple of days, if you do not have other questions.

robertzeibich commented 2 years ago

Thx for pointing the VCF fix out and for telling me about the chromosomes option. How do I need to use the chromosomes option? ---chromosomes 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y?

Do I need to run the tool again or can I include other chromosomes later?

anands-repo commented 2 years ago

It depends on how the chromosomes are named in the reference and the BAM file. If the chromosomes are named "1,2,3", then what you wrote here is correct. If the chromosomes are named "chr1,chr2, ..." then you would use --chromosomes "chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY"

On Wed, Apr 20, 2022 at 5:16 PM robertzeibich @.***> wrote:

Thx for pointing the VCF fix out and for telling me about the chromosomes option. How do I need to use the chromosomes option? ---chromosomes 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y?

— Reply to this email directly, view it on GitHub https://github.com/anands-repo/hello/issues/5#issuecomment-1104571757, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJJMQGVH4IHILFDRM4JAJLVGCM5NANCNFSM5SIFJ2LQ . You are receiving this because you commented.Message ID: @.***>

robertzeibich commented 2 years ago

That works perfectly. I was able to add chrX and chrY. I also have to admit that the tool runs much faster than DeepVariant CPU based and GATK.

What are the differences between those vcf files? results.best.vcf results.mean.vcf results.ngs_tgs.vcf results.ngs.vcf results.tgs.vcf How can I create a VCF file with all variants containing?

anands-repo commented 2 years ago

This is the correct output VCF: results.mean.vcf

Please reference the last 3 paragraphs or so of the README to see how to use this file.

robertzeibich commented 2 years ago

I believe in your correct_string Number and String must be exchanged. However, I currently have these corrected VCF files for multiple samples. When using GATK for variant calling, I was able to combine the single gVCF files into one gVCF using GenomicsDBImport --> GenotypeGVCFs --> ... I noticed that your tool does not produce gVCF files. Do you have an idea how I could combine the VCF files produced by your tool into one VCF file or generate out of the VCF files produced by your tool gVCF files?

Info: When I use GenomicsDBImport, I receive the following error message: A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs. MNP found at chr2:289039 in VCF. I have also done some digging online, but before starting to figure out a workaround, I thought of asking you for a solution because you might have one.

anands-repo commented 2 years ago

I guess the problem you are describing is that - you have multiple VCF files from multiple runs. You would like to combine them into one VCF file. The issue with MNP is probably that HELLO lists more than 2 alleles but selects only two of them using the SAMPLE1 column. In such a case, the unlisted allele(s) may be removed from the file.

For example, if REF = A, ALT=C,G,T and SAMPLE1=0/1, then we can rewrite this as REF=A,ALT=C,SAMPLE1=0/1. I do not have a script which does this automatically, but it must not be very hard to write such a script at least in theory.

By the way, I finished a new version of HELLO which may interest you. It incorporates many changes based mainly on issues you faced, so thanks for your feedback. All the algorithms and models remain the same. But the following convenience features are implemented

  1. New docker image that is singularity friendly (directly convert to singularity without having to hack the image)
  2. Better logging to convey exactly what is going on (no GNU parallel, use of progress bars, much more concise and readable messages)
  3. Better error and exception handling - avoiding phantom fails which do not reveal what exactly is the problem

I will have to update the repo tomorrow as I am waiting docker image upload overnight (I am in Pacific Time zone, US). Also its currently tested only for variant calling (not training) and only for Illumina data. But I think that serves your purpose, if you intend to run it again.

robertzeibich commented 2 years ago

I just went through the files to put an example together:

SAMPLE1.gvcf

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1

chr10 57686 . T . . END=57686 GT:DP:GQ:MIN_DP:PL 0/0:43:87:43:0,87,1590 ...

SAMPLE2.gvcf

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE2

chr10 57686 . T G, 40.64 . BaseQRankSum=-1.958;DP=44;ExcessHet=0.0000;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-3.333;RAW_MQandDP=70798,44;ReadPosRankSum=-1.385 GT:AD:DP:GQ:PGT:PID:PL:PS:SB 0|1:40,4,0:44:48:0|1:57659_A_T:48,0,1668,168,1680,1848:57659:17,23,3,1 ...

SAMPLE3.gvcf

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE3

chr10 57686 . T A, 81.64 . BaseQRankSum=0.326;DP=60;ExcessHet=0.0000;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-3.425;RAW_MQandDP=95382,60;ReadPosRankSum=-1.194 GT:AD:DP:GQ:PGT:PID:PL:PS:SB 0|1:54,6,0:60:89:0|1:57659_A_T:89,0,2249,252,2267,2519:57659:18,36,1,5 ...

COMBINED.gvcf

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2 SAMPLE3

chr10 57686 . T G,A 103.15 PASS AC=1,1;AF=9.615e-03,9.615e-03;AN=104;BaseQRankSum=0.326;DP=2312;ExcessHet=0.0424;FS=1.120;InbreedingCoeff=-0.0371;MLEAC=1,1;MLEAF=9.615e-03,9.615e-03;MQ=39.68;MQRankSum=-3.333e+00;QD=0.99;ReadPosRankSum=-1.385e+00;SOR=0.425;VQSLOD=-4.762e+00;culprit=QD GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:43,0,0:43:87:.:.:0,87,1590,87,1590,1590 0|1:40,4,0:44:48:0|1:57659_A_T:48,0,1668,168,1680,1848:57659 0|2:54,0,6:60:89:0|1:57659_A_T:89,252,2519,0,2267,2249:57659

When GATK is used, other information is provided too, but I am not entirely sure what the minimum requirements would be to use GATK for combining different gVCF files. I could easily write a script to solve this issue. I could not agree more with you on that, but I can imagine that you would greatly benefit from integrating this function in your tool yourself because then your tool would be compatible with GATK, which is still the gold standard solution. Just an idea. Tomorrow, I will see once more if there is a tool out there which might solve my current VCF file combining problem.

Glad to read that you incorporate my feedback. I just used the Illumina variant calling function for BGI WGS data. I know the tool was not trained for this task, but that seems to be the best option for me for now.

anands-repo commented 2 years ago

Thanks for the feedback regarding usage with GATK. Will look into that for a future release, and hope you are able to find a simple solution to the VCF merging problem.

FYI (in case you need to run more stuff): Latest repo is basically the branch named repo (git pull && git checkout repo). It goes with the docker image oddjobs/hello_deps.

robertzeibich commented 2 years ago

There is a simple solution: vcf-merge (http://vcftools.sourceforge.net/perl_module.html#vcf-merge).

There is also another thing I noticed when using your tool because when I use other variant callers, they use the sample name given in the BAM file, e.g. XXX003. Your tool just uses SAMPLE1 in the header. Can be easily fixed with a sed command at my end, but I just wanted to let you know.

Thanks for the FYI.

robertzeibich commented 2 years ago

I currently stumbled upon something else. When looking at the VCF files produced by other variant callers, I see that they contain much more contig entries.

...
##contig=<ID=chr22,length=50818468>
##contig=<ID=chrX,length=156040895>
##contig=<ID=chrY,length=57227415>
##contig=<ID=chrM,length=16569>
##contig=<ID=chr1_KI270706v1_random,length=175055>
##contig=<ID=chr1_KI270707v1_random,length=32032>
##contig=<ID=chr1_KI270708v1_random,length=127682>
##contig=<ID=chr1_KI270709v1_random,length=66860>
##contig=<ID=chr1_KI270710v1_random,length=40176>
##contig=<ID=chr1_KI270711v1_random,length=42210>
##contig=<ID=chr1_KI270712v1_random,length=176043>
##contig=<ID=chr1_KI270713v1_random,length=40745>
##contig=<ID=chr1_KI270714v1_random,length=41717>
##contig=<ID=chr2_KI270715v1_random,length=161471>
##contig=<ID=chr2_KI270716v1_random,length=153799>
##contig=<ID=chr3_GL000221v1_random,length=155397>
...

Do you know if these other chromosomes/contig entries (e.g., chr3_GL000221v1_random) can be created with your tool as well?

robertzeibich commented 2 years ago

You can ignore my last post for now. I will update what I have written once I have a better understanding about another tool I want to use.

anands-repo commented 2 years ago

You can pass any contigs through the —chromosomes options, if I am not mistaken. The list of contigs can be obtained from the reference fasta.

robertzeibich commented 2 years ago

I just used chr1 ... chrY and replaced the contigs in the VCF file. Thereafter, I still had to filter the VCF file because in the REF column, I found strange entries:

AAAGCACAAAN, AN, B, CN, GN, K, M, N, R, RGCGG, S, TGRGCG, TN, W, WG, Y, YS

To be honest, at the beginning, I thought your tool is not working correctly, but when I checked the positions, I found that these alleles/nucleotides and sequences can be found in the FATA file. I am not sure if you want to filter these out in the future. I would.

anands-repo commented 2 years ago

I see. Thanks for letting me know. Will keep the suggestions in mind for future updates.

anands-repo commented 2 years ago

Right. I think the similarity commands and the Mount structure needs to be simplified.

On Fri, Apr 15, 2022 at 6:04 PM robertzeibich @.***> wrote:

Which version of boost::Python do you recommend installing?

I guess my problem has something to do with the following directory: /tmp/vcftemp

If I run the script without /tmp:vcfemp bound to the container, I will receive the following error: Script:

REF="/scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta" NUM_THREADS=8 workdir="/projects/xm41/robert/hello" hello="/usr/local/hello/709b7c3/" mkdir -p /tmp/vcftemp module load singularity/3.9.2

-B /tmp/vcftemp:/tmp/vcftemp

singularity run -B /fs02 -B /fs04 -B /fs03:/fs03 -B /projects:/projects -B /scratch:/scratch ${hello}/hello.sif python /opt/hello/python/call.py --ibam GALF003.p0.03.bam --ref $REF --network models/illumina_multi_coverage_mapq_threshold_hg002_continue_run16.wrapper.dnn

--num_threads $NUM_THREADS --workdir $workdir/GALF003_1x_tmp --mapq_threshold 5

Error: INFO:root:Combining all hotspots and sharding 2022-04-16 10:57:52,214 Directory /projects/xm41/robert/hello/GALF003_1x_tmp/hotspots_hello_GALF003p003bamchr22 exists 2022-04-16 10:57:52,214 Counting hotspots 2022-04-16 10:57:52,214 Sharding with >= 0 items per cluster INFO:root:Completed sharding, creating caller commands for dumping training data INFO:root:Created call command for chromosome chr22 INFO:root:Launching all caller commands Computers / CPU cores / Max jobs to run 1:local / 8 / 1 ETA: 0s Left: 0 AVG: 0.00s 0 INFO:root:Completed runs, checking log files INFO:root:All commands completed correctly, preparing vcf sort -k1,1d -k2,2n sort -k1,1d -k2,2n Search string /tmp/vcftemp/expert0.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.ngs.tmp.vcf'] Search string /tmp/vcftemp/expert1.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.tgs.tmp.vcf'] Search string /tmp/vcftemp/expert2.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.ngs_tgs.tmp.vcf'] Search string /tmp/vcftemp/best.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.best.tmp.vcf'] Search string /tmp/vcftemp/*mean.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.mean.tmp.vcf'] Choice histogram = {0: 0, 1: 0, 2: 0}

If I run the script with /tmp:vcfemp bound to the container, I will receive the following error: Script:

REF="/scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta" NUM_THREADS=8 workdir="/projects/xm41/robert/hello" hello="/usr/local/hello/709b7c3/" mkdir -p /tmp/vcftemp module load singularity/3.9.2

singularity run -B /fs02 -B /fs04 -B /fs03:/fs03 -B /projects:/projects -B /scratch:/scratch -B /tmp/vcftemp:/tmp/vcftemp ${hello}/hello.sif python /opt/hello/python/call.py --ibam GALF003.p0.03.bam --ref $REF --network models/illumina_multi_coverage_mapq_threshold_hg002_continue_run16.wrapper.dnn

--num_threads $NUM_THREADS --workdir $workdir/GALF003_1x_tmp --mapq_threshold 5

Error: ETA: 0s Left: 4 AVG: 0.03s local:4/497/100%/0.0s ETA: 0s Left: 1 AVG: 0.03s local:1/500/100%/0.0s ETA: 0s Left: 0 AVG: 0.03s local:0/501/100%/0.0s INFO:root:Combining all hotspots and sharding 2022-04-16 10:45:26,049 Directory /projects/xm41/robert/hello/GALF003_1x_tmp/hotspots_hello_GALF003p003bamchr22 exists 2022-04-16 10:45:26,050 Counting hotspots 2022-04-16 10:45:26,050 Sharding with >= 0 items per cluster INFO:root:Completed sharding, creating caller commands for dumping training data INFO:root:Created call command for chromosome chr22 INFO:root:Launching all caller commands Computers / CPU cores / Max jobs to run 1:local / 8 / 1 ETA: 0s Left: 0 AVG: 0.00s 0 INFO:root:Completed runs, checking log files INFO:root:All commands completed correctly, preparing vcf Traceback (most recent call last): File "/opt/hello/python/prepareVcf.py", line 255, in shutil.rmtree(args.tmpdir); File "/opt/miniconda3/lib/python3.7/shutil.py", line 498, in rmtree onerror(os.rmdir, path, sys.exc_info()) File "/opt/miniconda3/lib/python3.7/shutil.py", line 496, in rmtree os.rmdir(path) OSError: [Errno 16] Device or resource busy: '/tmp/vcftemp'

Unfortunately, I cannot create the problem I mentioned in my previous message.

— Reply to this email directly, view it on GitHub https://github.com/anands-repo/hello/issues/5#issuecomment-1100493038, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJJMQHHVJ2I2UWMQOWJ5VLVFIG3BANCNFSM5SIFJ2LQ . You are receiving this because you commented.Message ID: @.***>

ngtanthanh commented 1 year ago

.