anands-repo / hello

DNN-based small variant caller
MIT License
12 stars 0 forks source link

illumina vartiant calling failed #6

Open Mahmoudbassuoni opened 2 years ago

Mahmoudbassuoni commented 2 years ago

Hi, @anands-repo I am trying to run a variant calling using the illumina model but I am finally getting files with not even one call, I don't know what's wrong. I am using the same code I used before but for another bam file, every time it ends up with this message. Computers / CPU cores / Max jobs to run 1:local / 16 / 1 ETA: 0s Left: 0 AVG: 0.00s 0 INFO:root:Completed runs, checking log files INFO:root:All commands completed correctly, preparing vcf Search string /tmp/vcftemp/*expert0.vcf Found 0 files Running command ['vcf-sort', '/home/bassyouni/Documents/VC/hello/results.ngs.tmp.vcf'] sort -k1,1d -k2,2n Search string /tmp/vcftemp/*expert1.vcf Found 0 files Running command ['vcf-sort', '/home/bassyouni/Documents/VC/hello/results.tgs.tmp.vcf'] sort -k1,1d -k2,2n Search string /tmp/vcftemp/*expert2.vcf Found 0 files Running command ['vcf-sort', '/home/bassyouni/Documents/VC/hello/results.ngs_tgs.tmp.vcf'] sort -k1,1d -k2,2n Search string /tmp/vcftemp/*best.vcf Found 0 files Running command ['vcf-sort', '/home/bassyouni/Documents/VC/hello/results.best.tmp.vcf'] sort -k1,1d -k2,2n Search string /tmp/vcftemp/*mean.vcf Found 0 files Running command ['vcf-sort', '/home/bassyouni/Documents/VC/hello/results.mean.tmp.vcf'] sort -k1,1d -k2,2n Choice histogram = {0: 0, 1: 0, 2: 0}

anands-repo commented 2 years ago

Received a similar comment from another user. The issue in that case was some of the paths weren't correctly accessible inside the container. Please check that all paths are accessible inside the container: including BAM file(s), reference fasta, and Neural Network model paths.

Beside this, suggest running on a single chromsome to make sure that case works.

Mahmoudbassuoni commented 2 years ago

Hi @anands-repo , I decided to go out of the container and ran it again with single chromosomes but still the same thing. I ran it before with different dataset and it was fine not sure what could be wrong.

anands-repo commented 2 years ago

I pushed a new repo under branch repo. It is under development, but I have tested it for Illumina reads for individual chromosomes and it works. It goes with the docker image oddjobs/hello_deps. This branch and docker image have the following features which may help with debugging issues of the type you are facing:

  1. New docker image that is singularity friendly (directly convert to singularity without having to hack the image)
  2. Better logging to convey exactly what is going on (no GNU parallel, use of progress bars, much more concise and readable messages)
  3. Better error and exception handling - avoiding fails which do not reveal what exactly is the problem

Algorithms and models are the same as before

PS: workdir will need cleanup before each rerun

Mahmoudbassuoni commented 2 years ago

Thanks @anands-repo , will give it a try and see what happens....

maryawood commented 1 year ago

Hello,

I'm running into a similar problem of having VCF results without any calls in them. The program looks to be proceeding normally when I run it inside the docker container:

2022-08-26 19:41:50,599 INFO:Launching hotspot detection for chromosome chr16
Hotspots, chromosome chr16: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 501/501 [03:50<00:00,  2.17it/s]
2022-08-26 19:45:41,115 INFO:Combining all hotspots and sharding
2022-08-26 19:45:42,965 INFO:Completed sharding, creating caller commands for running NN
2022-08-26 19:45:42,966 INFO:Created call commands for chromosome chr16
2022-08-26 19:45:42,966 INFO:Finished hotspot generation for all chromosomes. Launching all caller commands.
Caller progress: 0it [00:00, ?it/s]
2022-08-26 19:45:42,967 INFO:Completed runs, checking log files
2022-08-26 19:45:42,967 INFO:Completed running all caller commands correctly, preparing vcf
VCF prep: 0it [00:00, ?it/s]
sort -k1,1d -k2,2n
2022-08-26 19:45:43,010 INFO:Completed runs. Results in local/results.output.vcf

But the VCF file does not contain any calls. I've checked that the BAM is readable inside the container, so I don't think that's the problem. I've also tried to install the software outside of the docker container to be sure, but I'm running into many install issues - would it be possible to make the Dockerfile available so I could see the necessary requirements for installation?

anands-repo commented 1 year ago

Hi,

Please find the information below

Dockerfile:

FROM ubuntu:bionic

# Initial installations
RUN apt-get -y update && apt-get -y install wget build-essential vim tmux vcftools bedtools

# Path where everything happens
WORKDIR /opt

# Install miniconda3
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
RUN bash Miniconda3-py37_4.8.2-Linux-x86_64.sh -b -p /opt/miniconda3
ENV PATH=/opt/miniconda3/bin:$PATH

# Install boost with boost::python
COPY user-config.jam /root/boost_1_69_0/tools/build/src/user-config.jam
RUN wget https://boostorg.jfrog.io/artifactory/main/release/1.69.0/source/boost_1_69_0.tar.gz && \
        tar xvfz boost_1_69_0.tar.gz && cd boost_1_69_0 && ./bootstrap.sh --prefix=/opt/boost && \
        ./b2 install && rm -rf boost_1_69_0*

# Install conda packages
RUN conda install pytorch==1.2.0 torchvision==0.4.0 -c pytorch && \
        conda install -y -c bioconda pysam pybedtools && \
        pip install pyvcf scikit-bio && \
        conda install -y biopython cmake h5py intervaltree more-itertools scikit-learn matplotlib tqdm

# Additional environment settings (weird that it doesn't work by itself)
RUN ldconfig
ENV LD_LIBRARY_PATH=/opt/boost/lib:$LD_LIBRARY_PATH

# Cleanup
RUN rm -rf /tmp/*

The Dockerfile uses the following user-config.jam file:

using python
    : 3.7
    : /opt/miniconda3/bin/python
    : /opt/miniconda3/include/python3.7m
;
anands-repo commented 1 year ago

Regarding the issue - from the logs, it looks like the initial heuristic that filters reads, and shortlists putative loci to run the Neural Network on, doesn't return anything. The following script can be run standalone to see what's happening:

python \
    python/HotspotDetectorDVFiltered.py \
        --bam BAMFILE \
        --ref REFERENCE \
        --region chr16 \
        --mapq_threshold 5 \
        --output /tmp/hotspots.json > /tmp/hotspots_logs.log 2>&1
maryawood commented 1 year ago

Thanks for your response! I'm still trying to get a local install working for further testing based on your Dockerfile, but I'm running into some issues with conda permissions based on the install location. In the meantime, I tried out the hotspot command you shared from within the available docker container. Even if I run it using the --debug flag, all I get in the log file is this:

2022-08-29 22:12:32,644 Started script
2022-08-29 22:16:40,987 Completed running the script

Should I be seeing additional messages?

anands-repo commented 1 year ago

Yes if --debug flag is used there will be a lot of text displayed (or in this case written to the log files). Did you notice any output in the /tmp/hotspots.json file, or is it still empty?

If it is empty, my best guess is that the reads are being filtered off at the quality control stage and not being used in any analysis. These are the read quality control filters:

    usable    = True;
    usable    = usable and not alignment.is_unmapped;  # Remove unmapped reads
    usable    = usable and not (alignment.is_secondary or alignment.is_supplementary);  # Remove secondary or supplementary alignments
    usable    = usable and not alignment.is_duplicate;  # Remove duplicate reads
    usable    = usable and (not alignment.is_paired or alignment.is_proper_pair);  # Remove improper pairs in the case of paired alignments
    usable    = usable and (alignment.mapping_quality > 0);  # Remove 0-mapping quality reads