MikkelSchubert / paleomix

Pipelines and tools for the processing of ancient and modern HTS data.
https://paleomix.readthedocs.io/en/stable/
MIT License
43 stars 19 forks source link

"Could not retrieve index file" #36

Closed ab08028 closed 4 years ago

ab08028 commented 4 years ago

Hi there, paleomix team! Thanks as always for such a great pipeline 👍 I'm running into an odd error that I've never seen before, and I'm not sure if I need to worry about. I've successfully run paleomix dozens of times on my old lab server, but I'm running it on a new server now and am getting the non-fatal error:

[E::idx_find_and_load] Could not retrieve index file for '/net/harris/vol1/home/beichman/bears/paleomix/paleomix/004_UARC_IT_APP2/004_UARC_IT_APP2/brown_bear/004_UARC_IT_APP2/Lib_S2.rmdup.collapsed.bam [E::idx_find_and_load] Could not retrieve index file for '/net/harris/vol1/home/beichman/bears/paleomix/paleomix/004_UARC_IT_APP2/004_UARC_IT_APP2/brown_bear/004_UARC_IT_APP2/Lib_S2.rmdup.normal.bam

The pipeline does not fail that node, and appears to complete successfully (final bam files are output), however I wanted to check with you that nothing insidious was going on due to this lack of bam index at the this stage. Looking back at past runs of paleomix I believe .bai files were output at that stage when I ran it previously.

Any ideas of what might be going on and if I need to worry about it? Thanks so much for your advice! :)

Software versions: paleomix: 1.2.14 samtools 1.9 bwa 0.7.17 R 3.6.1 picard 2.21.7

Makefile pasted below:


# -*- mode: Yaml; -*-
# Timestamp: 2018-07-02T10:11:43.849578
#
# Default options.
# Can also be specific for a set of samples, libraries, and lanes,
# by including the "Options" hierarchy at the same level as those
# samples, libraries, or lanes below. This does not include
# "Features", which may only be specific globally.
Options:
  # Sequencing platform, see SAM/BAM reference for valid values
  Platform: Illumina
  # Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+)
  # or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to
  # specify 'Solexa', to handle reads on the Solexa scale. This is
  # used during adapter-trimming and sequence alignment
  QualityOffset: 33
  # Split a lane into multiple entries, one for each (pair of) file(s)
  # found using the search-string specified for a given lane. Each
  # lane is named by adding a number to the end of the given barcode.
  SplitLanesByFilenames: yes
  # Compression format for FASTQ reads; 'gz' for GZip, 'bz2' for BZip2
  CompressionFormat: bz2

  # Settings for trimming of reads, see AdapterRemoval man-page
  AdapterRemoval:
     # Adapter sequences, set and uncomment to override defaults
#     --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG
#     --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
     # Some BAM pipeline defaults differ from AR defaults;
     # To override, change these value(s):
     --mm: 3
     --minlength: 25
     # Extra features enabled by default; change 'yes' to 'no' to disable
     --collapse: yes
     --trimns: yes
     --trimqualities: yes

  # Settings for aligners supported by the pipeline
  Aligners:
    # Choice of aligner software to use, either "BWA" or "Bowtie2"
    Program: BWA

    # Settings for mappings performed using BWA
    BWA:
      # One of "backtrack", "bwasw", or "mem"; see the BWA documentation
      # for a description of each algorithm (defaults to 'backtrack')
      Algorithm: mem
      # Filter aligned reads with a mapping quality (Phred) below this value
      # 20180702: AB changed MinQuality from 0 --> 30 (recommended by paleomix)
      MinQuality: 30
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes
      # May be disabled ("no") for aDNA alignments with the 'aln' algorithm.
      # Post-mortem damage localizes to the seed region, which BWA expects to
      # have few errors (sets "-l"). See http://pmid.us/22574660
      # 20180702: AB changed UseSeed to 'no' for aDNA
      UseSeed: yes
      # Additional command-line options may be specified for the "aln"
      # call(s), as described below for Bowtie2 below.

    # Settings for mappings performed using Bowtie2
    Bowtie2:
      # Filter aligned reads with a mapping quality (Phred) below this value
      MinQuality: 0
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes
      # Examples of how to add additional command-line options
#      --trim5: 5
#      --trim3: 5
      # Note that the colon is required, even if no value is specified
      --very-sensitive:
      # Example of how to specify multiple values for an option
#      --rg:
#        - CN:SequencingCenterNameHere
#        - DS:DescriptionOfReadGroup

  # Mark / filter PCR duplicates. If set to 'filter', PCR duplicates are
  # removed from the output files; if set to 'mark', PCR duplicates are
  # flagged with bit 0x400, and not removed from the output files; if set to
  # 'no', the reads are assumed to not have been amplified. Collapsed reads
  # are filtered using the command 'paleomix rmdup_duplicates', while "normal"
  # reads are filtered using Picard MarkDuplicates.
  PCRDuplicates: filter

  # Command-line options for mapDamage; note that the long-form
  # options are expected; --length, not -l, etc. Uncomment the
  # "mapDamage" line adding command-line options below.
  mapDamage:
    # By default, the pipeline will downsample the input to 100k hits
    # when running mapDamage; remove to use all hits
    --downsample: 100000

  # Set to 'yes' exclude a type of trimmed reads from alignment / analysis;
  # possible read-types reflect the output of AdapterRemoval
  ExcludeReads:
    # Exclude single-end reads (yes / no)?
    Single: no
    # Exclude non-collapsed paired-end reads (yes / no)?
    Paired: no
    # Exclude paired-end reads for which the mate was discarded (yes / no)?
    Singleton: no
    # Exclude overlapping paired-ended reads collapsed into a single sequence
    # by AdapterRemoval (yes / no)?
    Collapsed: no
    # Like 'Collapsed', but only for collapsed reads truncated due to the
    # presence of ambiguous or low quality bases at read termini (yes / no).
    CollapsedTruncated: no

  # Optional steps to perform during processing.
  Features:
    # Generate BAM without realignment around indels (yes / no)
    RawBAM: yes
    # Generate indel-realigned BAM using the GATK Indel realigner (yes / no)
    RealignedBAM: no
    # To disable mapDamage, write 'no'; to generate basic mapDamage plots,
    # write 'plot'; to build post-mortem damage models, write 'model',
    # and to produce rescaled BAMs, write 'rescale'. The 'model' option
    # includes the 'plot' output, and the 'rescale' option includes both
    # 'plot' and 'model' results. All analyses are carried out per library.
    mapDamage: no
    # Generate coverage information for the raw BAM (wo/ indel realignment).
    # If one or more 'RegionsOfInterest' have been specified for a prefix,
    # additional coverage files are generated for each alignment (yes / no)
    Coverage: yes
    # Generate histogram of number of sites with a given read-depth, from 0
    # to 200. If one or more 'RegionsOfInterest' have been specified for a
    # prefix, additional histograms are generated for each alignment (yes / no)
    Depths: yes
    # Generate summary table for each target (yes / no)
    Summary: yes
    # Generate histogram of PCR duplicates, for use with PreSeq (yes / no)
    DuplicateHist: no

# Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
  # Replace 'NAME_OF_PREFIX' with name of the prefix; this name
  # is used in summary statistics and as part of output filenames.
  # southern sea otter:
  polar_bear:
    # Replace 'PATH_TO_PREFIX' with the path to .fasta file containing the
    # references against which reads are to be mapped. Using the same name
    # as filename is strongly recommended (e.g. /path/to/Human_g1k_v37.fasta
    # should be named 'Human_g1k_v37').
    Path: /net/harris/vol1/home/beichman/reference_genomes/polar_bear/polar_bear.fasta
  # northern sea otter
  brown_bear:
    # Replace 'PATH_TO_PREFIX' with the path to .fasta file containing the
    # references against which reads are to be mapped. Using the same name
    # as filename is strongly recommended (e.g. /path/to/Human_g1k_v37.fasta
    # should be named 'Human_g1k_v37').
    Path: /net/harris/vol1/home/beichman/reference_genomes/brown_bear/brown_bear.fasta

    # (Optional) Uncomment and replace 'PATH_TO_BEDFILE' with the path to a
    # .bed file listing extra regions for which coverage / depth statistics
    # should be calculated; if no names are specified for the BED records,
    # results are named after the chromosome / contig. Change 'NAME' to the
    # name to be used in summary statistics and output filenames.
#    RegionsOfInterest:
#      NAME: PATH_TO_BEDFILE

# Mapping targets are specified using the following structure. Uncomment and
 #replace 'NAME_OF_TARGET' with the desired prefix for filenames.
004_UARC_IT_APP2:
  #Uncomment and replace 'NAME_OF_SAMPLE' with the name of this sample.
  004_UARC_IT_APP2:
    #Uncomment and replace 'NAME_OF_LIBRARY' with the name of this sample.
    Lib_S14:
      #Uncomment and replace 'NAME_OF_LANE' with the name of this lane,
      #and replace 'PATH_WITH_WILDCARDS' with the path to the FASTQ files
      #to be trimmed and mapped for this lane (may include wildcards).
      SRR5878348: /net/harris/vol1/home/beichman/bears/fastqs.fromENA.nobackup/SRR5878348_{Pair}.fastq.gz
    Lib_S2:
      #Uncomment and replace 'NAME_OF_LANE' with the name of this lane,
      #and replace 'PATH_WITH_WILDCARDS' with the path to the FASTQ files
      #to be trimmed and mapped for this lane (may include wildcards).
      SRR5878360: /net/harris/vol1/home/beichman/bears/fastqs.fromENA.nobackup/SRR5878360_{Pair}.fastq.gz
MikkelSchubert commented 4 years ago

Hi,

As far as I can tell, the "could not retrieve index file" error message is simply a side-effect of the pipeline not indexing BAMs unless that index is needed for the final product or for an intermediate step. I am not sure why it has been turned into a visible error, but it should be harmless.

If you do run into problems, then do not hesitate to open a new issue.

Best regards, Mikkel