Closed robertzeibich closed 2 years ago
For shorter reads, Illumina models may be used, and for longer reads, the PacBio HiFi models may be used. However, the models are trained on the Illumina and PacBio HiFi error profiles, and not optimized for other sequencing platforms. I would say if there is a model/algorithm that is trained on BGI reads from another tool, that will likely provide better performance than HELLO.
Thanks for your quick response. I want to run the Illumina variant calling option on our high performance cluster (HPC) in a singularity container for safety reasons.
However, currently, I am trying to understand how to get the docker container running. Steps already taken:
Can you tell me how to use the docker container?
When do I need to build the tool? I git cloned the repository and tried to build the tool (cmake . && make -j 12), but without success.
I am also a little confused by the note "To properly download models, git-lfs needs to be installed. Once installed, please do git lfs pull inside the git repo". Is git-lfs already installed in the docker container?
The docker image only contains the dependencies, not hello itself. Take a look at the following steps to build it:
# Launch container
docker run -it oddjobs/hello_image.x86_64 /bin/bash
# Inside container
apt-get -y update && apt-get -y install git && git clone https://github.com/anands-repo/hello.git
mkdir git-lfs
cd git-lfs/
wget https://github.com/git-lfs/git-lfs/releases/download/v3.1.2/git-lfs-linux-amd64-v3.1.2.tar.gz # Or another version of git lfs
tar xvfz git-lfs-linux-amd64-v3.1.2.tar.gz
./install.sh
cd
git lfs install
cd hello/
git lfs pull
cmake . && make -j 12
Thanks for your quick response again. I am getting closer, but I am not there yet.
I executed the steps you recommended. Thereafter, I executed a docker commit to make the changes permanent and afterwards, I tried to convert the docker image into a singularity container using singularity 3.9.2.
Steps:
I noticed that in the docker container HELLO is stored under /root/hello. In the singularity container I do not have access to root. Do you have an idea what I could do because I can only run HELLO in a singularity container on our HPC due to security reasons.
This is a bit of a sticky problem. Does singularity only play well with certain paths like /opt
? Currently some dependencies are in /usr
etc if I am not mistaken - do not know whether those are also blocked by singularity.
I don't know whether the following will work, but I can think of three steps to try assuming that /opt
works fine - if not, replace with appropriate path that may work (steps 2 and 3 if /usr
is not supported by singularity):
miniconda3
to /opt
(it is currently in /root
). After the move, the PATH environment may need to be changed to add the new path of miniconda3/bin. Refer: https://www.anaconda.com/blog/moving-conda-environments/opt
, install (build from source) BOOST with boost::python support in /opt
. Note that boost with boost::python support needs a few changes to the standard boost compilation flow. Change CMakeLists.txt file in hello (see commented lines at the start of the file) accordingly. I do not remember off the top of my head, but I believe I used boost 1.68. This is the step that can lead to some difficulties - it may need some trial and error to get everything working together (boost installation, CMakeLists.txt etc), especially given that there already exists a boost installation.Thanks for the reply. I also got some IT support at my end now. Hopefully, we will get the tool running based on your suggestions.
I also have some other questions which I would like to have answered while we are working on getting the tool running. How to use the num_threads? Is there a min or max limit? I guess the workdir points to the output file directory. Is that correct?
num_threads
is the number of CPU cores to be used in each run. Correct - workdir
points to the output directory.
The tools is running, but I currently get the following error message (slurm output):
ETA: 0s Left: 1 AVG: 0.63s local:1/10985/100%/0.6s ETA: 0s Left: 0 AVG: 0.63s local:0/10986/100%/0.6s INFO:root:Completed runs, checking log files ERROR:root:File /projects/xm41/robert/hello/GALF003/features_bams_GALF003bam__/features0.log doesn't have termination string
In the workdir, the following files where created: features_bams_GALF003bam hotspotsbamsGALF003bamchr16 hotspotsbamsGALF003bamchr3 hotspotsbamsGALF003bamchr1 hotspotsbamsGALF003bamchr17 hotspotsbamsGALF003bamchr4 hotspotsbamsGALF003bamchr10 hotspotsbamsGALF003bamchr18 hotspotsbamsGALF003bamchr5 hotspotsbamsGALF003bamchr11 hotspotsbamsGALF003bamchr19 hotspotsbamsGALF003bamchr6 hotspotsbamsGALF003bamchr12 hotspotsbamsGALF003bamchr2 hotspotsbamsGALF003bamchr7 hotspotsbamsGALF003bamchr13 hotspotsbamsGALF003bamchr20 hotspotsbamsGALF003bamchr8 hotspotsbamsGALF003bamchr14 hotspotsbamsGALF003bamchr21 hotspotsbamsGALF003bamchr9 hotspotsbamsGALF003bamchr15 hotspotsbamsGALF003__bamchr22
Do you have an idea what might be wrong and how I can fix that issue? Do you have an example somewhere I could try out? I would just like to make sure that the tool runs as supposed to.
If you wanted to try an Illumina dataset to do sanity checking, you can do one from Genome In a Bottle. I haven’t directly used them, but I believe one of those should work directly. However, we are also checking whether the installation process is correct. If there is an issue with installation, neither case will provide an output on the platform you are employing.
It may be easier to send to me the run command and the contents of /projects/xm41/robert/hello/GALF003/features_bams_GALF003bam__/features0.log
My suspicion is that boost::Python isn’t the right version or hasn’t been installed correctly. Standard boost installation or build from source doesn’t come with boost::Python enabled. The build files need to be modified a little for that and it wasn’t documented well when I tried to install it.
On Thu, Apr 14, 2022 at 9:37 PM robertzeibich @.***> wrote:
The tools is running, but I currently get the following error message (slurm output):
ETA: 0s Left: 1 AVG: 0.63s local:1/10985/100%/0.6s ETA: 0s Left: 0 AVG: 0.63s local:0/10986/100%/0.6s INFO:root:Completed runs, checking log files ERROR:root:File /projects/xm41/robert/hello/GALF003/features_bams_GALF003bam__/features0.log doesn't have termination string
In the workdir, the following files where created: features_bams_GALF003bam hotspotsbamsGALF003bamchr16 hotspotsbamsGALF003bamchr3 hotspotsbamsGALF003bamchr1 hotspotsbamsGALF003bamchr17 hotspotsbamsGALF003bamchr4 hotspotsbamsGALF003bamchr10 hotspotsbamsGALF003bamchr18 hotspotsbamsGALF003bamchr5 hotspotsbamsGALF003bamchr11 hotspotsbamsGALF003bamchr19 hotspotsbamsGALF003bamchr6 hotspotsbamsGALF003bamchr12 hotspotsbamsGALF003bamchr2 hotspotsbamsGALF003bamchr7 hotspotsbamsGALF003bamchr13 hotspotsbamsGALF003bamchr20 hotspotsbamsGALF003bamchr8 hotspotsbamsGALF003bamchr14 hotspotsbamsGALF003bamchr21 hotspotsbamsGALF003bamchr9 hotspotsbamsGALF003bamchr15 hotspotsbamsGALF003__bamchr22
Do you have an idea what might be wrong and how I can fix that issue? Do you have an example somewhere I could try out? I would just like to make sure that the tools behaves in a way the tool is supposed to behave.
— Reply to this email directly, view it on GitHub https://github.com/anands-repo/hello/issues/5#issuecomment-1099843567, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJJMQEHD3PMV32J5SUIC7TVFDW7HANCNFSM5SIFJ2LQ . You are receiving this because you commented.Message ID: @.***>
Which version of boost::Python do you recommend using/installing? Do you know how I need to modify the build files?
I guess my problem has something to do with the following directory: /tmp/vcftemp
1) WITHOUT /tmp:vcfemp If I run the script without /tmp:vcfemp bound to the container, I will receive the following error: Script:
REF="/scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta"
NUM_THREADS=8
workdir="/projects/xm41/robert/hello"
hello="/usr/local/hello/709b7c3/"
mkdir -p /tmp/vcftemp
module load singularity/3.9.2
#-B /tmp/vcftemp:/tmp/vcftemp
singularity run -B /fs02 -B /fs04 -B /fs03:/fs03 -B /projects:/projects -B /scratch:/scratch ${hello}/hello.sif \
python /opt/hello/python/call.py \
--ibam GALF003.p0.03.bam \
--ref $REF \
--network models/illumina_multi_coverage_mapq_threshold_hg002_continue_run16.wrapper.dnn \
--num_threads $NUM_THREADS \
--workdir $workdir/GALF003_1x_tmp --mapq_threshold 5
Error:
INFO:root:Combining all hotspots and sharding
2022-04-16 10:57:52,214 Directory /projects/xm41/robert/hello/GALF003_1x_tmp/hotspots_hello___GALF003__p0__03__bam_chr22_ exists
2022-04-16 10:57:52,214 Counting hotspots
2022-04-16 10:57:52,214 Sharding with >= 0 items per cluster
INFO:root:Completed sharding, creating caller commands for dumping training data
INFO:root:Created call command for chromosome chr22
INFO:root:Launching all caller commands
Computers / CPU cores / Max jobs to run
1:local / 8 / 1
ETA: 0s Left: 0 AVG: 0.00s 0
INFO:root:Completed runs, checking log files
INFO:root:All commands completed correctly, preparing vcf
sort -k1,1d -k2,2n
sort -k1,1d -k2,2n
Search string /tmp/vcftemp/*expert0.vcf
Found 0 files
Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.ngs.tmp.vcf']
Search string /tmp/vcftemp/*expert1.vcf
Found 0 files
Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.tgs.tmp.vcf']
Search string /tmp/vcftemp/*expert2.vcf
Found 0 files
Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.ngs_tgs.tmp.vcf']
Search string /tmp/vcftemp/*best.vcf
Found 0 files
Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.best.tmp.vcf']
Search string /tmp/vcftemp/*mean.vcf
Found 0 files
Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.mean.tmp.vcf']
Choice histogram = {0: 0, 1: 0, 2: 0}
2) WITH /tmp:vcfemp If I run the script with /tmp:vcfemp bound to the container, I will receive the following error: Script:
REF="/scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta"
NUM_THREADS=8
workdir="/projects/xm41/robert/hello"
hello="/usr/local/hello/709b7c3/"
mkdir -p /tmp/vcftemp
module load singularity/3.9.2
singularity run -B /fs02 -B /fs04 -B /fs03:/fs03 -B /projects:/projects -B /scratch:/scratch -B /tmp/vcftemp:/tmp/vcftemp ${hello}/hello.sif \
python /opt/hello/python/call.py \
--ibam GALF003.p0.03.bam \
--ref $REF \
--network models/illumina_multi_coverage_mapq_threshold_hg002_continue_run16.wrapper.dnn \
--num_threads $NUM_THREADS \
--workdir $workdir/GALF003_1x_tmp --mapq_threshold 5
Error:
ETA: 0s Left: 4 AVG: 0.03s local:4/497/100%/0.0s
ETA: 0s Left: 1 AVG: 0.03s local:1/500/100%/0.0s
ETA: 0s Left: 0 AVG: 0.03s local:0/501/100%/0.0s
INFO:root:Combining all hotspots and sharding
2022-04-16 10:45:26,049 Directory /projects/xm41/robert/hello/GALF003_1x_tmp/hotspots_hello___GALF003__p0__03__bam_chr22_ exists
2022-04-16 10:45:26,050 Counting hotspots
2022-04-16 10:45:26,050 Sharding with >= 0 items per cluster
INFO:root:Completed sharding, creating caller commands for dumping training data
INFO:root:Created call command for chromosome chr22
INFO:root:Launching all caller commands
Computers / CPU cores / Max jobs to run
1:local / 8 / 1
ETA: 0s Left: 0 AVG: 0.00s 0
INFO:root:Completed runs, checking log files
INFO:root:All commands completed correctly, preparing vcf
Traceback (most recent call last):
File "/opt/hello/python/prepareVcf.py", line 255, in <module>
shutil.rmtree(args.tmpdir);
File "/opt/miniconda3/lib/python3.7/shutil.py", line 498, in rmtree
onerror(os.rmdir, path, sys.exc_info())
File "/opt/miniconda3/lib/python3.7/shutil.py", line 496, in rmtree
os.rmdir(path)
OSError: [Errno 16] Device or resource busy: '/tmp/vcftemp'
Unfortunately, I cannot create the problem I mentioned in my previous message. However, when I look in the folder "/projects/xm41/robert/hello/GALF003/features_bams_GALF003bam__/features0.log", the file does not exist.
I also had a chat with our technical support for the high performance cluster. The person I talked to wanted to know if there is a possibility to use /scratch instead of /tmp because of space limitations and the directory /tmp is not shared between nodes.
Sorry, the previous mail was sent premature.
For all inputs not in home directory, can we check whether the inputs exist? For example, run “du -sh filename” inside the singularity container. Can we direct all outputs to your home directory? I believe singularity auto mounts your home directory at the same path inside a container, so there is no need to mount anything for this purpose - paths in your home directory should work directly inside the singularity container. (I could be wrong on this based on the cluster setup). Would like the singularity Mount issues to be resolved, otherwise we could be debugging the wrong issue. I also see use of both relative and absolute paths. Let’s make all paths absolute.
NOTE: Please cleanup the output directory before each run.
On Fri, Apr 15, 2022 at 10:33 PM Anand Ramachandran @.***> wrote:
Right. I think the similarity commands and the Mount structure needs to be simplified.
On Fri, Apr 15, 2022 at 6:04 PM robertzeibich @.***> wrote:
Which version of boost::Python do you recommend installing?
I guess my problem has something to do with the following directory: /tmp/vcftemp
If I run the script without /tmp:vcfemp bound to the container, I will receive the following error: Script:
REF="/scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta" NUM_THREADS=8 workdir="/projects/xm41/robert/hello" hello="/usr/local/hello/709b7c3/" mkdir -p /tmp/vcftemp module load singularity/3.9.2
-B /tmp/vcftemp:/tmp/vcftemp
singularity run -B /fs02 -B /fs04 -B /fs03:/fs03 -B /projects:/projects -B /scratch:/scratch ${hello}/hello.sif python /opt/hello/python/call.py --ibam GALF003.p0.03.bam --ref $REF --network models/illumina_multi_coverage_mapq_threshold_hg002_continue_run16.wrapper.dnn
--num_threads $NUM_THREADS --workdir $workdir/GALF003_1x_tmp --mapq_threshold 5
Error: INFO:root:Combining all hotspots and sharding 2022-04-16 10:57:52,214 Directory /projects/xm41/robert/hello/GALF003_1x_tmp/hotspots_hello_GALF003p003bamchr22 exists 2022-04-16 10:57:52,214 Counting hotspots 2022-04-16 10:57:52,214 Sharding with >= 0 items per cluster INFO:root:Completed sharding, creating caller commands for dumping training data INFO:root:Created call command for chromosome chr22 INFO:root:Launching all caller commands Computers / CPU cores / Max jobs to run 1:local / 8 / 1 ETA: 0s Left: 0 AVG: 0.00s 0 INFO:root:Completed runs, checking log files INFO:root:All commands completed correctly, preparing vcf sort -k1,1d -k2,2n sort -k1,1d -k2,2n Search string /tmp/vcftemp/expert0.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.ngs.tmp.vcf'] Search string /tmp/vcftemp/expert1.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.tgs.tmp.vcf'] Search string /tmp/vcftemp/expert2.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.ngs_tgs.tmp.vcf'] Search string /tmp/vcftemp/best.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.best.tmp.vcf'] Search string /tmp/vcftemp/*mean.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.mean.tmp.vcf'] Choice histogram = {0: 0, 1: 0, 2: 0}
If I run the script with /tmp:vcfemp bound to the container, I will receive the following error: Script:
REF="/scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta" NUM_THREADS=8 workdir="/projects/xm41/robert/hello" hello="/usr/local/hello/709b7c3/" mkdir -p /tmp/vcftemp module load singularity/3.9.2
singularity run -B /fs02 -B /fs04 -B /fs03:/fs03 -B /projects:/projects -B /scratch:/scratch -B /tmp/vcftemp:/tmp/vcftemp ${hello}/hello.sif python /opt/hello/python/call.py --ibam GALF003.p0.03.bam --ref $REF --network models/illumina_multi_coverage_mapq_threshold_hg002_continue_run16.wrapper.dnn
--num_threads $NUM_THREADS --workdir $workdir/GALF003_1x_tmp --mapq_threshold 5
Error: ETA: 0s Left: 4 AVG: 0.03s local:4/497/100%/0.0s ETA: 0s Left: 1 AVG: 0.03s local:1/500/100%/0.0s ETA: 0s Left: 0 AVG: 0.03s local:0/501/100%/0.0s INFO:root:Combining all hotspots and sharding 2022-04-16 10:45:26,049 Directory /projects/xm41/robert/hello/GALF003_1x_tmp/hotspots_hello_GALF003p003bamchr22 exists 2022-04-16 10:45:26,050 Counting hotspots 2022-04-16 10:45:26,050 Sharding with >= 0 items per cluster INFO:root:Completed sharding, creating caller commands for dumping training data INFO:root:Created call command for chromosome chr22 INFO:root:Launching all caller commands Computers / CPU cores / Max jobs to run 1:local / 8 / 1 ETA: 0s Left: 0 AVG: 0.00s 0 INFO:root:Completed runs, checking log files INFO:root:All commands completed correctly, preparing vcf Traceback (most recent call last): File "/opt/hello/python/prepareVcf.py", line 255, in shutil.rmtree(args.tmpdir); File "/opt/miniconda3/lib/python3.7/shutil.py", line 498, in rmtree onerror(os.rmdir, path, sys.exc_info()) File "/opt/miniconda3/lib/python3.7/shutil.py", line 496, in rmtree os.rmdir(path) OSError: [Errno 16] Device or resource busy: '/tmp/vcftemp'
Unfortunately, I cannot create the problem I mentioned in my previous message.
— Reply to this email directly, view it on GitHub https://github.com/anands-repo/hello/issues/5#issuecomment-1100493038, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJJMQHHVJ2I2UWMQOWJ5VLVFIG3BANCNFSM5SIFJ2LQ . You are receiving this because you commented.Message ID: @.***>
The inputs exist inside the singularity container: 3.1G /scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta 16K /opt/hello/python/call.py 4.5G /scratch/xm41/ct/bamsDown/step1_bams1x/GALF003_1x.bam access to /projects is also given
I cannot use my home directory because I only have 10GB available. When I run call.py with absolute paths + workdir /projects/xm41/robert/hello/GALF003_1x_test4/ inside the singularity container, the following files are created:
The bed and vcf files do not have an entry, just a header with no variants listed. The console output is the same as mentioned in my previous post under 1) WITHOUT /tmp:vcfemp Error.
How I entered the singularity container:
module load singularity/3.9.2
singularity run -B /fs02 -B /fs04 -B /fs03:/fs03 -B /projects:/projects -B /scratch:/scratch /usr/local/hello/709b7c3/hello.sif
Code used inside the singularity container:
/opt/hello/python/call.py \
--ibam /scratch/xm41/ct/bamsDown/step1_bams1x/GALF003_1x.bam \
--ref /scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta \
--network models/illumina_multi_coverage_mapq_threshold_hg002_continue_run16.wrapper.dnn \
--num_threads 36 \
--workdir /projects/xm41/robert/hello/GALF003_1x_test4 --mapq_threshold 5
I would also be available for a Zoom call to sort this out. I am in Melbourne. What is your time zone?
[UPDATED]
[UPDATED 2]
Are you using a 1x coverage BAM file? A whole genome BAM with 4.5GB size and a suffix _1x seem to indicate as much.
Anyway, let’s continue - unfortunately, I am not able to do a Zoom for several weeks as I am away on a family matter. What I can do is review some logs etc on my phone for these several weeks. My schedule is non-existent and I pick up any odd 5 minutes available at some point in time during the day.
There are .sh and .log files inside the features and hotspot files. They are numbered, say
Ideally, I would want the full working directory for a small chromosome, say, 22, but I would need access to a computer to look at a tar.gz. If you think it’a doable, please share a tar.gz of the workdir for chromosome 22. I will look for computer access.
I just read your message and I am looking for a bam file currently because I am not allowed to share data from the server. I am following your idea with genome in a bottle: https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/ChineseTrio/alignment.index.ChineseTrio_Illumina300X100X_wgs_novoalign_GRCh37_GRCh38_NHGRI_04062016.HG007 I thought of downloading the first one: ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ChineseTrio/HG007_NA24695-hu38168_mother/NA24695_Mother_HiSeq100x/NHGRI_Illumina100X_Chinesetrio_novoalign_bams/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam Do you think this could work? If you have another file in mind, please let me know.
Take a look at this:
ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/NHGRI_Illumina300X_AJtrio_novoalign_bams/
They have 60x BAM files, which must be quite a lot smaller.
HELLO works on a coverage of 30x WGS BGI data and the same data downsampled to a coverage of 1x. The error was not using the correct network path. Take away: Check all paths. Thx for your help again. Do you know if I need to run a kind of quality control (filtering) on the VCF files like VQSR or has that already been covered by the tool?
Great to hear that!
There is no additional filtration to be run on the VCF. But there is a small error in the VCF header. Some VCF processing commands work without fixing it, but some may not. In any case, it’s very simple to fix the issue and how to fix it is given at the end of the README file, if you scroll down.
Thanks again, for debugging it. It’s nice to hear it worked out in your use case.
On Wed, Apr 20, 2022 at 4:28 AM robertzeibich @.***> wrote:
HELLO works on a coverage of 30x WGS BGI data and the same data downsampled to a coverage of 1x. The error was not using the correct network path. Take away: Check all paths. Thx for your help again. Do you know if I need to run a kind of quality control on the VCF files like VQSR or has that already been covered by the tool itself?
— Reply to this email directly, view it on GitHub https://github.com/anands-repo/hello/issues/5#issuecomment-1103823605, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJJMQD7ROLR24VC5SL6LLDVF7S4LANCNFSM5SIFJ2LQ . You are receiving this because you commented.Message ID: @.***>
A final note: by default HELLO calls only chromosomes 1-22. To change this behavior, one would need to pass the —chromosomes option.
Will close the issue in a couple of days, if you do not have other questions.
Thx for pointing the VCF fix out and for telling me about the chromosomes option. How do I need to use the chromosomes option? ---chromosomes 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y?
Do I need to run the tool again or can I include other chromosomes later?
It depends on how the chromosomes are named in the reference and the BAM file. If the chromosomes are named "1,2,3", then what you wrote here is correct. If the chromosomes are named "chr1,chr2, ..." then you would use --chromosomes "chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY"
On Wed, Apr 20, 2022 at 5:16 PM robertzeibich @.***> wrote:
Thx for pointing the VCF fix out and for telling me about the chromosomes option. How do I need to use the chromosomes option? ---chromosomes 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y?
— Reply to this email directly, view it on GitHub https://github.com/anands-repo/hello/issues/5#issuecomment-1104571757, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJJMQGVH4IHILFDRM4JAJLVGCM5NANCNFSM5SIFJ2LQ . You are receiving this because you commented.Message ID: @.***>
That works perfectly. I was able to add chrX and chrY. I also have to admit that the tool runs much faster than DeepVariant CPU based and GATK.
What are the differences between those vcf files? results.best.vcf results.mean.vcf results.ngs_tgs.vcf results.ngs.vcf results.tgs.vcf How can I create a VCF file with all variants containing?
This is the correct output VCF: results.mean.vcf
Please reference the last 3 paragraphs or so of the README to see how to use this file.
I believe in your correct_string Number and String must be exchanged. However, I currently have these corrected VCF files for multiple samples. When using GATK for variant calling, I was able to combine the single gVCF files into one gVCF using GenomicsDBImport --> GenotypeGVCFs --> ... I noticed that your tool does not produce gVCF files. Do you have an idea how I could combine the VCF files produced by your tool into one VCF file or generate out of the VCF files produced by your tool gVCF files?
Info: When I use GenomicsDBImport, I receive the following error message: A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs. MNP found at chr2:289039 in VCF. I have also done some digging online, but before starting to figure out a workaround, I thought of asking you for a solution because you might have one.
I guess the problem you are describing is that - you have multiple VCF files from multiple runs. You would like to combine them into one VCF file. The issue with MNP is probably that HELLO lists more than 2 alleles but selects only two of them using the SAMPLE1 column. In such a case, the unlisted allele(s) may be removed from the file.
For example, if REF = A, ALT=C,G,T
and SAMPLE1=0/1
, then we can rewrite this as REF=A,ALT=C,SAMPLE1=0/1
. I do not have a script which does this automatically, but it must not be very hard to write such a script at least in theory.
By the way, I finished a new version of HELLO which may interest you. It incorporates many changes based mainly on issues you faced, so thanks for your feedback. All the algorithms and models remain the same. But the following convenience features are implemented
I will have to update the repo tomorrow as I am waiting docker image upload overnight (I am in Pacific Time zone, US). Also its currently tested only for variant calling (not training) and only for Illumina data. But I think that serves your purpose, if you intend to run it again.
I just went through the files to put an example together:
SAMPLE1.gvcf
chr10 57686 . T
SAMPLE2.gvcf
chr10 57686 . T G,
SAMPLE3.gvcf
chr10 57686 . T A,
COMBINED.gvcf
chr10 57686 . T G,A 103.15 PASS AC=1,1;AF=9.615e-03,9.615e-03;AN=104;BaseQRankSum=0.326;DP=2312;ExcessHet=0.0424;FS=1.120;InbreedingCoeff=-0.0371;MLEAC=1,1;MLEAF=9.615e-03,9.615e-03;MQ=39.68;MQRankSum=-3.333e+00;QD=0.99;ReadPosRankSum=-1.385e+00;SOR=0.425;VQSLOD=-4.762e+00;culprit=QD GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:43,0,0:43:87:.:.:0,87,1590,87,1590,1590 0|1:40,4,0:44:48:0|1:57659_A_T:48,0,1668,168,1680,1848:57659 0|2:54,0,6:60:89:0|1:57659_A_T:89,252,2519,0,2267,2249:57659
When GATK is used, other information is provided too, but I am not entirely sure what the minimum requirements would be to use GATK for combining different gVCF files. I could easily write a script to solve this issue. I could not agree more with you on that, but I can imagine that you would greatly benefit from integrating this function in your tool yourself because then your tool would be compatible with GATK, which is still the gold standard solution. Just an idea. Tomorrow, I will see once more if there is a tool out there which might solve my current VCF file combining problem.
Glad to read that you incorporate my feedback. I just used the Illumina variant calling function for BGI WGS data. I know the tool was not trained for this task, but that seems to be the best option for me for now.
Thanks for the feedback regarding usage with GATK. Will look into that for a future release, and hope you are able to find a simple solution to the VCF merging problem.
FYI (in case you need to run more stuff): Latest repo is basically the branch named repo
(git pull && git checkout repo
). It goes with the docker image oddjobs/hello_deps
.
There is a simple solution: vcf-merge (http://vcftools.sourceforge.net/perl_module.html#vcf-merge).
There is also another thing I noticed when using your tool because when I use other variant callers, they use the sample name given in the BAM file, e.g. XXX003. Your tool just uses SAMPLE1 in the header. Can be easily fixed with a sed command at my end, but I just wanted to let you know.
Thanks for the FYI.
I currently stumbled upon something else. When looking at the VCF files produced by other variant callers, I see that they contain much more contig entries.
...
##contig=<ID=chr22,length=50818468>
##contig=<ID=chrX,length=156040895>
##contig=<ID=chrY,length=57227415>
##contig=<ID=chrM,length=16569>
##contig=<ID=chr1_KI270706v1_random,length=175055>
##contig=<ID=chr1_KI270707v1_random,length=32032>
##contig=<ID=chr1_KI270708v1_random,length=127682>
##contig=<ID=chr1_KI270709v1_random,length=66860>
##contig=<ID=chr1_KI270710v1_random,length=40176>
##contig=<ID=chr1_KI270711v1_random,length=42210>
##contig=<ID=chr1_KI270712v1_random,length=176043>
##contig=<ID=chr1_KI270713v1_random,length=40745>
##contig=<ID=chr1_KI270714v1_random,length=41717>
##contig=<ID=chr2_KI270715v1_random,length=161471>
##contig=<ID=chr2_KI270716v1_random,length=153799>
##contig=<ID=chr3_GL000221v1_random,length=155397>
...
Do you know if these other chromosomes/contig entries (e.g., chr3_GL000221v1_random) can be created with your tool as well?
You can ignore my last post for now. I will update what I have written once I have a better understanding about another tool I want to use.
You can pass any contigs through the —chromosomes options, if I am not mistaken. The list of contigs can be obtained from the reference fasta.
I just used chr1 ... chrY and replaced the contigs in the VCF file. Thereafter, I still had to filter the VCF file because in the REF column, I found strange entries:
AAAGCACAAAN, AN, B, CN, GN, K, M, N, R, RGCGG, S, TGRGCG, TN, W, WG, Y, YS
To be honest, at the beginning, I thought your tool is not working correctly, but when I checked the positions, I found that these alleles/nucleotides and sequences can be found in the FATA file. I am not sure if you want to filter these out in the future. I would.
I see. Thanks for letting me know. Will keep the suggestions in mind for future updates.
Right. I think the similarity commands and the Mount structure needs to be simplified.
On Fri, Apr 15, 2022 at 6:04 PM robertzeibich @.***> wrote:
Which version of boost::Python do you recommend installing?
I guess my problem has something to do with the following directory: /tmp/vcftemp
If I run the script without /tmp:vcfemp bound to the container, I will receive the following error: Script:
REF="/scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta" NUM_THREADS=8 workdir="/projects/xm41/robert/hello" hello="/usr/local/hello/709b7c3/" mkdir -p /tmp/vcftemp module load singularity/3.9.2
-B /tmp/vcftemp:/tmp/vcftemp
singularity run -B /fs02 -B /fs04 -B /fs03:/fs03 -B /projects:/projects -B /scratch:/scratch ${hello}/hello.sif python /opt/hello/python/call.py --ibam GALF003.p0.03.bam --ref $REF --network models/illumina_multi_coverage_mapq_threshold_hg002_continue_run16.wrapper.dnn
--num_threads $NUM_THREADS --workdir $workdir/GALF003_1x_tmp --mapq_threshold 5
Error: INFO:root:Combining all hotspots and sharding 2022-04-16 10:57:52,214 Directory /projects/xm41/robert/hello/GALF003_1x_tmp/hotspots_hello_GALF003p003bamchr22 exists 2022-04-16 10:57:52,214 Counting hotspots 2022-04-16 10:57:52,214 Sharding with >= 0 items per cluster INFO:root:Completed sharding, creating caller commands for dumping training data INFO:root:Created call command for chromosome chr22 INFO:root:Launching all caller commands Computers / CPU cores / Max jobs to run 1:local / 8 / 1 ETA: 0s Left: 0 AVG: 0.00s 0 INFO:root:Completed runs, checking log files INFO:root:All commands completed correctly, preparing vcf sort -k1,1d -k2,2n sort -k1,1d -k2,2n Search string /tmp/vcftemp/expert0.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.ngs.tmp.vcf'] Search string /tmp/vcftemp/expert1.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.tgs.tmp.vcf'] Search string /tmp/vcftemp/expert2.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.ngs_tgs.tmp.vcf'] Search string /tmp/vcftemp/best.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.best.tmp.vcf'] Search string /tmp/vcftemp/*mean.vcf Found 0 files Running command ['vcf-sort', '/projects/xm41/robert/hello/GALF003_1x_tmp/results.mean.tmp.vcf'] Choice histogram = {0: 0, 1: 0, 2: 0}
If I run the script with /tmp:vcfemp bound to the container, I will receive the following error: Script:
REF="/scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta" NUM_THREADS=8 workdir="/projects/xm41/robert/hello" hello="/usr/local/hello/709b7c3/" mkdir -p /tmp/vcftemp module load singularity/3.9.2
singularity run -B /fs02 -B /fs04 -B /fs03:/fs03 -B /projects:/projects -B /scratch:/scratch -B /tmp/vcftemp:/tmp/vcftemp ${hello}/hello.sif python /opt/hello/python/call.py --ibam GALF003.p0.03.bam --ref $REF --network models/illumina_multi_coverage_mapq_threshold_hg002_continue_run16.wrapper.dnn
--num_threads $NUM_THREADS --workdir $workdir/GALF003_1x_tmp --mapq_threshold 5
Error: ETA: 0s Left: 4 AVG: 0.03s local:4/497/100%/0.0s ETA: 0s Left: 1 AVG: 0.03s local:1/500/100%/0.0s ETA: 0s Left: 0 AVG: 0.03s local:0/501/100%/0.0s INFO:root:Combining all hotspots and sharding 2022-04-16 10:45:26,049 Directory /projects/xm41/robert/hello/GALF003_1x_tmp/hotspots_hello_GALF003p003bamchr22 exists 2022-04-16 10:45:26,050 Counting hotspots 2022-04-16 10:45:26,050 Sharding with >= 0 items per cluster INFO:root:Completed sharding, creating caller commands for dumping training data INFO:root:Created call command for chromosome chr22 INFO:root:Launching all caller commands Computers / CPU cores / Max jobs to run 1:local / 8 / 1 ETA: 0s Left: 0 AVG: 0.00s 0 INFO:root:Completed runs, checking log files INFO:root:All commands completed correctly, preparing vcf Traceback (most recent call last): File "/opt/hello/python/prepareVcf.py", line 255, in shutil.rmtree(args.tmpdir); File "/opt/miniconda3/lib/python3.7/shutil.py", line 498, in rmtree onerror(os.rmdir, path, sys.exc_info()) File "/opt/miniconda3/lib/python3.7/shutil.py", line 496, in rmtree os.rmdir(path) OSError: [Errno 16] Device or resource busy: '/tmp/vcftemp'
Unfortunately, I cannot create the problem I mentioned in my previous message.
— Reply to this email directly, view it on GitHub https://github.com/anands-repo/hello/issues/5#issuecomment-1100493038, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJJMQHHVJ2I2UWMQOWJ5VLVFIG3BANCNFSM5SIFJ2LQ . You are receiving this because you commented.Message ID: @.***>
.
I have human WGS data sequenced by BGI. Is my data appropriate for HELLO? If yes, which pipeline (Illumina, PacBio) should I use?