`--output_vcf` parameter doesn't work when including a relative path

Fer020707 commented 2 years ago

Hi, i am trying to run the program using:

nextflow run iarcbioinfo/needlestack -with-docker --bed bed/SOPHIA_targetregions_hg19.bed --input_bams bam/SOPHIA/ --ref ref/ucsc.hg19.fasta --output_vcf output/SOPHIA_needlestack.vcf

The bed file is sorted, the bam files are in the corresponding folder with their respective .bai, and in the reference folder are the .fasta, .fai, .dict, .amb, .ann, .bwt, . pac and .sa.

But I get an error in the execution process. I would appreciate your response, thank you.

Error message:

executor >  local (3)
[76/9e2dc1] process > bed                                         [100%] 1 of 1 ✔
[18/a87fe0] process > split_bed                                   [100%] 1 of 1 ✔
[91/4ac861] process > mpileup2vcf (chr1_17345370-chrX_14883636... [  0%] 0 of 1
[-        ] process > collect_vcf_result                          -
Error executing process > 'mpileup2vcf (chr1_17345370-chrX_14883636_regions)'

Caused by:
  Process `mpileup2vcf (chr1_17345370-chrX_14883636_regions)` terminated with an error exit status (1)

Command executed:

  for cur_bam in BAM/*.bam
  do
      if [ "BAM" == "FILE" ]; then
          # use bam file name as sample name
          bam_file_name=$(basename "${cur_bam%.*}")
          # remove whitespaces from name
          SM1="$(echo -e "${bam_file_name}" | tr -d '[[:space:]]')"
          SM2="$(echo -e "${bam_file_name}" | tr -d '[[:space:]]')"
      else
          # get bam file names
   bam_file_name=$(basename "${cur_bam%.*}")
   # remove whitespaces from name
   SM1="$(echo -e "${bam_file_name}" | tr -d '[[:space:]]')"
   # extract sample name from bam file read group info field
   SM2=$(samtools view -H $cur_bam | grep "^@RG" | tail -n1 | sed "s/.*SM:\([^  ]*\).*/\1/" | tr -d '[:space:]')
      fi

      printf "$SM1  $SM2\n" >> names.txt
  done

  if [ "FALSE" != "FALSE" ]; then
      abs_pairs_file=$(readlink -f FALSE)
  else
      abs_pairs_file="FALSE"
  fi
  set -o pipefail
  i=1

  ...

  { while read bed_line; do
      samtools mpileup --fasta-ref ucsc.hg19.fasta --region $bed_line --ignore-RG --min-BQ 13 --min-MQ 0 --max-idepth 1000000 --max-depth 50000 BAM/*.bam | sed 's/     /   *   */g'
      i=$((i+1))
  done < chr1_17345370-chrX_14883636_regions
  } | mpileup2readcounts 0 -5 false 3 0 | Rscript /home/fer/.nextflow/assets/iarcbioinfo/needlestack/bin/needlestack.r --pairs_file=${abs_pairs_file} --source_path=/home/fer/.nextflow/assets/iarcbioinfo/needlestack/bin/ --out_file=chr1_17345370-chrX_14883636_regions.vcf --fasta_ref=ucsc.hg19.fasta --input_bams=BAM/ --ref_genome=null --GQ_threshold=50 --min_coverage=30 --min_reads=3 --min_af=0 --SB_type=SOR --SB_threshold_SNV=100 --SB_threshold_indel=100 --output_all_SNVs=false --do_plots=ALL --do_alignments=false --plot_labels=true --add_contours=true --extra_rob=false --min_af_extra_rob=0.2 --min_prop_extra_rob=0.1 --max_prop_extra_rob=0.5 --afmin_power=-1 --sigma=0.1

Command exit status:
  1

...

Command error:
  [W::bam_hdr_read] EOF marker is absent. The input is probably truncated
  [W::bam_hdr_read] EOF marker is absent. The input is probably truncated
  ...
  [warning] samtools mpileup option `L` is functional, but deprecated. Please switch to using bcftools mpileup in future.
  [W::bam_hdr_read] EOF marker is absent. The input is probably truncated
  [W::bam_hdr_read] EOF marker is absent. The input is probably truncated
..
  [mpileup] fail to load index for BAM/ISEM114.recal.reads.bam
  Error : Pileup file is empty

Work dir:
  /home/fer/Documents/work/91/4ac861e494a55ec631f77ea31b0ce6

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

mfoll commented 2 years ago

Hi, These are errors produced by samtools when trying to extract read counts information. There might be an issue with your BAM file(s): [W::bam_hdr_read] EOF marker is absent. The input is probably truncated Maybe try to do a samtools quickcheck? (https://www.htslib.org/doc/samtools-quickcheck.html) Can you also check this particular .bai is present: [mpileup] fail to load index for BAM/ISEM114.recal.reads.bam

Fer020707 commented 2 years ago

Thank you very much for your reply. Download the bam files again, and run the program again. Apparently the previous problem was solved, but it shows me another error when collecting the results of the vcf

Error message:


executor >  local (4)
[f5/bc2a25] process > bed                                         [100%] 1 of 1 ✔
[45/b0edda] process > split_bed                                   [100%] 1 of 1 ✔
[7b/5fbeff] process > mpileup2vcf (chr1_17345370-chrX_14883636... [100%] 1 of 1 ✔
[84/dfbf6c] process > collect_vcf_result                          [  0%] 0 of 1
Error executing process > 'collect_vcf_result'

Caused by:
  Process `collect_vcf_result` terminated with an error exit status (1)

Command executed:

  mkdir VCF
        mv *.vcf VCF
        # Extract the header from the first VCF
        sed '/^#CHROM/q' VCF/chr1_17345370-chrX_14883636_regions.vcf > header.txt

        # Add contigs in the VCF header
        cat ucsc.hg19.fasta.fai | cut -f1,2 | sed -e 's/^/##contig=<ID=/' -e 's/[    ][  ]*/,length=/' -e 's/$/>/' > contigs.txt
...
Command executed:

  mkdir VCF
        mv *.vcf VCF
        # Extract the header from the first VCF
        sed '/^#CHROM/q' VCF/chr1_17345370-chrX_14883636_regions.vcf > header.txt

        # Add contigs in the VCF header
        cat ucsc.hg19.fasta.fai | cut -f1,2 | sed -e 's/^/##contig=<ID=/' -e 's/[    ][  ]*/,length=/' -e 's/$/>/' > contigs.txt
        sed -i '/##reference=.*/ r contigs.txt' header.txt

        # Add version numbers in the VCF header
        echo '##command=nextflow run iarcbioinfo/needlestack -with-docker --bed bed/SOPHIA_targetregions_hg19_ORDERED.bed --input_bams bam/SOPHIA/ --ref ref/ucsc.hg19.fasta --output_vcf output/SOPHIA_needlestack.vcf' > versions.txt
...
# Check if sort command allows sorting in natural order (chr1 chr2 chr10 instead of chr1 chr10 chr2)
        if [ `sort --help | grep -c 'version-sort' ` == 0 ]
        then
            sort_ops="-k1,1d"
        else
            sort_ops="-k1,1V"
        fi
        # Add all VCF contents and sort
        grep --no-filename -v '^#' VCF/*.vcf | LC_ALL=C sort -t '   ' $sort_ops -k2,2n >> header.txt
        mv header.txt output/SOPHIA_needlestack.vcf
...

``

mfoll commented 2 years ago

I think the actual error message has been cut, could copy the content of the .command.err file in the work/84/dfbf6c... folder? Also can you check the content of the VCF/chr1_17345370-chrX_14883636_regions.vcf file in the same folder? This last process is simply merging all VCF produced when splitting your bed into multiple regions, but in your case you have only one .

Fer020707 commented 2 years ago

When reviewing the .command.sh file, the last command could not be executed (mv header.txt output/all_variants.vcf) because at the time of executing that command it was in the path work/84/dfbf6c.. and therefore it could not find the specified output path, since it had used a relative path at the time of executing the program. By correcting that, the program worked perfectly. Thank you very much.

mfoll commented 2 years ago

Indeed the --output_vcf parameter is supposed to be only the name of the VCF file (not including its path) but indeed that's confusing. You can use it together with the --output_folder parameter. We made this choice because there are other outputs but we could easily also allow the --output_vcf parameter to include a relative path. Thanks for spotting this, I hope you will find needlestack useful!

Fer020707 commented 2 years ago

Thank you very much for everything, excellent program

IARCbioinfo / needlestack

`--output_vcf` parameter doesn't work when including a relative path #192