MikkelSchubert / paleomix

Pipelines and tools for the processing of ancient and modern HTS data.
https://paleomix.readthedocs.io/en/stable/
MIT License
43 stars 19 forks source link

Duplicated reads error #55

Closed laneatmore closed 4 months ago

laneatmore commented 5 months ago

Hi, I have been using paleomix for a long time (thanks for this program!) and am now trying to use it on some data I downloaded from SRA. Every time I run the pipeline I get an error that there are duplicated reads in my bam files and am prompted to run paleomix dupcheck:

14:13:35 INFO [4/10] Started detecting duplicate input in 2 files in './M-Toku_1/Oket_v2/M-Toku_1'
14:13:36 ERROR NodeError while detecting duplicate input in 2 files in './M-Toku_1/Oket_v2/M-Toku_1':
14:13:36 INFO Saving error logs to '/scratch/st-cspeller-1/paleomix/bam_pipeline.20240222_015623_01.log'
14:13:36 ERROR     The same read was found multiple times at position 13983656 on 'NC_068421.1'!
14:13:36 ERROR         Name:      'SRR11048435.42748732'
14:13:36 ERROR         Sequence:  'TACAGAGTAGAGACAGATATAGACTACAGAGTAGAGACAGACATAGACTACAGGTTAGAGACAGATATAGACTACAGATTAGTGACAGATAAAGACTACAGAGTAGAGACAGACATAGACTACAGAGTAGAGACAGATATAGACTACAGAT'
14:13:36 ERROR         Qualities: '???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????'
14:13:36 ERROR     
14:13:36 ERROR     Read was found in these BAM files:
14:13:36 ERROR        1 mate 1 read, 1 mate 2 read in './M-Toku_1/Oket_v2/M-Toku_1/M-Toku_1_1.rmdup.normal.bam'
14:13:36 ERROR     
14:13:36 ERROR     This indicates that the same data has been included multiple times in the project. This can be because multiple copies of the same files were used, or because one or more files contain multiple copies of the same reads. The command 'paleomix dupcheck' may be used to review the potentially duplicated data in these BAM files.
14:13:36 ERROR     
14:13:36 ERROR     If this error was a false positive, then you may execute the following command(s) to mark this test as having succeeded:
14:13:36 ERROR     $ touch './M-Toku_1/Oket_v2/M-Toku_1/M-Toku_1_1.duplications_checked'
14:13:36 ERROR NodeError while detecting duplicate input in 2 files in './M-Toku_1/Oket_v2/M-Toku_1':
14:13:36 ERROR     The same read was found multiple times at position 13983656 on 'NC_068421.1'!
14:13:36 ERROR         Name:      'SRR11048435.42748732'
14:13:36 ERROR         Sequence:  'TACAGAGTAGAGACAGATATAGACTACAGAGTAGAGACAGACATAGACTACAGGTTAGAGACAGATATAGACTACAGATTAGTGACAGATAAAGACTACAGAGTAGAGACAGACATAGACTACAGAGTAGAGACAGATATAGACTACAGAT'
14:13:36 ERROR         Qualities: '???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????'
14:13:36 ERROR     
14:13:36 ERROR     Read was found in these BAM files:
14:13:36 ERROR        1 mate 1 read, 1 mate 2 read in './M-Toku_1/Oket_v2/M-Toku_1/M-Toku_1_1.rmdup.normal.bam'
14:13:36 ERROR     
14:13:36 ERROR     This indicates that the same data has been included multiple times in the project. This can be because multiple copies of the same files were used, or because one or more files contain multiple copies of the same reads. The command 'paleomix dupcheck' may be used to review the potentially duplicated data in these BAM files.
14:13:36 ERROR     
14:13:36 ERROR     If this error was a false positive, then you may execute the following command(s) to mark this test as having succeeded:
14:13:36 ERROR     $ touch './M-Toku_1/Oket_v2.duplications_checked'
14:18:29 INFO [1/3] Finished calculating coverage for ./M-Toku_1/Oket_v2/M-Toku_1/M-Toku_1_1.rmdup.normal.bam

When I try to run dupcheck, paleomix says it can't find the command. I've tried checking duplicated reads in the fastq files with fastQC and seqkit and deduplicating with czid-dedup but it hasn't made a difference. I've also tried setting PCRDuplicates: to 'mark' instead of 'filter' but still get the same results.

Following this page, I would expect there to be a problem with library merging, but I only have one paired set of fastq files so I think this is quite unlikely.

I've independently run Picard MarkDuplicates on M-Toku_1_1.rmdup.normal.bam output from paleomix and the output was as follows:

Feb 23, 2024 10:29:46 AM com.intel.gkl.NativeLibraryLoader load
INFO: Loading libgkl_compression.so from jar:file:/arc/home/latmore/install/jar_root/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri Feb 23 10:29:46 PST 2024] MarkDuplicates INPUT=[M-Toku_1_1.rmdup.normal.bam] OUTPUT=M-Toku_1_1.rmdup.normal.markdup.bam METRICS_FILE=M-Toku_1_1.rmdup.normal.markdup.metrics.txt    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false FLOW_MODE=false FLOW_QUALITY_SUM_STRATEGY=false USE_END_IN_UNPAIRED_READS=false USE_UNPAIRED_CLIPPED_END=false UNPAIRED_END_UNCERTAINTY=0 FLOW_SKIP_FIRST_N_FLOWS=0 FLOW_Q_IS_KNOWN_END=false FLOW_EFFECTIVE_QUALITY_THRESHOLD=15 ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Fri Feb 23 10:29:53 PST 2024] Executing as latmore@se166 on Linux 3.10.0-1160.108.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 17.0.2+8-86; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: 3.1.1-10-g48545af74-SNAPSHOT
INFO    2024-02-23 10:29:53     MarkDuplicates  Start of doWork freeMemory: 120237616; totalMemory: 129892352; maxMemory: 2075918336
INFO    2024-02-23 10:29:53     MarkDuplicates  Reading input file and constructing read end information.
INFO    2024-02-23 10:29:53     MarkDuplicates  Will retain up to 7521443 data points before spilling to disk.
WARNING 2024-02-23 10:29:54     AbstractOpticalDuplicateFinderCommandLineProgram        A field field parsed out of a read name was expected to contain an integer and did not. Read name: SRR11048435.46621791. Cause: String 'SRR11048435.46621791' did not start with a parsable number.
INFO    2024-02-23 10:30:00     MarkDuplicates  Read     1,000,000 records.  Elapsed time: 00:00:06s.  Time for last 1,000,000:    6s.  Last read position: NC_068421.1:57,372,380
INFO    2024-02-23 10:30:00     MarkDuplicates  Tracking 38173 as yet unmatched pairs. 22953 records in RAM.
INFO    2024-02-23 10:30:05     MarkDuplicates  Read     2,000,000 records.  Elapsed time: 00:00:11s.  Time for last 1,000,000:    4s.  Last read position: NC_068422.1:17,209,794
INFO    2024-02-23 10:30:05     MarkDuplicates  Tracking 79255 as yet unmatched pairs. 7440 records in RAM.
INFO    2024-02-23 10:30:09     MarkDuplicates  Read     3,000,000 records.  Elapsed time: 00:00:15s.  Time for last 1,000,000:    4s.  Last read position: NC_068422.1:63,511,990
INFO    2024-02-23 10:30:09     MarkDuplicates  Tracking 98849 as yet unmatched pairs. 14294 records in RAM.
INFO    2024-02-23 10:30:14     MarkDuplicates  Read     4,000,000 records.  Elapsed time: 00:00:20s.  Time for last 1,000,000:    4s.  Last read position: NC_068424.1:6,914,389
INFO    2024-02-23 10:30:14     MarkDuplicates  Tracking 140106 as yet unmatched pairs. 7912 records in RAM.
INFO    2024-02-23 10:30:18     MarkDuplicates  Read     5,000,000 records.  Elapsed time: 00:00:24s.  Time for last 1,000,000:    4s.  Last read position: NC_068424.1:58,049,535
INFO    2024-02-23 10:30:18     MarkDuplicates  Tracking 168664 as yet unmatched pairs. 22621 records in RAM.
INFO    2024-02-23 10:30:22     MarkDuplicates  Read     6,000,000 records.  Elapsed time: 00:00:28s.  Time for last 1,000,000:    4s.  Last read position: NC_068425.1:29,222,665
INFO    2024-02-23 10:30:22     MarkDuplicates  Tracking 207276 as yet unmatched pairs. 8687 records in RAM.
INFO    2024-02-23 10:30:28     MarkDuplicates  Read     7,000,000 records.  Elapsed time: 00:00:34s.  Time for last 1,000,000:    5s.  Last read position: NC_068427.1:2,757,165
INFO    2024-02-23 10:30:28     MarkDuplicates  Tracking 252991 as yet unmatched pairs. 2659 records in RAM.
INFO    2024-02-23 10:30:41     MarkDuplicates  Read     8,000,000 records.  Elapsed time: 00:00:47s.  Time for last 1,000,000:   12s.  Last read position: NC_068428.1:319,762
INFO    2024-02-23 10:30:41     MarkDuplicates  Tracking 286384 as yet unmatched pairs. 2469 records in RAM.
INFO    2024-02-23 10:30:45     MarkDuplicates  Read     9,000,000 records.  Elapsed time: 00:00:51s.  Time for last 1,000,000:    3s.  Last read position: NC_068429.1:11,392,553
INFO    2024-02-23 10:30:45     MarkDuplicates  Tracking 329684 as yet unmatched pairs. 10352 records in RAM.
INFO    2024-02-23 10:30:49     MarkDuplicates  Read    10,000,000 records.  Elapsed time: 00:00:55s.  Time for last 1,000,000:    4s.  Last read position: NC_068430.1:21,123,994
INFO    2024-02-23 10:30:49     MarkDuplicates  Tracking 365230 as yet unmatched pairs. 16868 records in RAM.
INFO    2024-02-23 10:30:58     MarkDuplicates  Read    11,000,000 records.  Elapsed time: 00:01:04s.  Time for last 1,000,000:    8s.  Last read position: NC_068430.1:68,068,383
INFO    2024-02-23 10:30:58     MarkDuplicates  Tracking 385315 as yet unmatched pairs. 24956 records in RAM.
INFO    2024-02-23 10:31:09     MarkDuplicates  Read    12,000,000 records.  Elapsed time: 00:01:15s.  Time for last 1,000,000:   11s.  Last read position: NC_068431.1:41,805,776
INFO    2024-02-23 10:31:09     MarkDuplicates  Tracking 428028 as yet unmatched pairs. 17617 records in RAM.
INFO    2024-02-23 10:31:22     MarkDuplicates  Read    13,000,000 records.  Elapsed time: 00:01:28s.  Time for last 1,000,000:   12s.  Last read position: NC_068432.1:32,984,505
INFO    2024-02-23 10:31:22     MarkDuplicates  Tracking 445664 as yet unmatched pairs. 8236 records in RAM.
INFO    2024-02-23 10:31:42     MarkDuplicates  Read    14,000,000 records.  Elapsed time: 00:01:48s.  Time for last 1,000,000:   20s.  Last read position: NC_068433.1:36,893,095
INFO    2024-02-23 10:31:42     MarkDuplicates  Tracking 488607 as yet unmatched pairs. 18174 records in RAM.
INFO    2024-02-23 10:34:22     MarkDuplicates  Read    15,000,000 records.  Elapsed time: 00:04:28s.  Time for last 1,000,000:  159s.  Last read position: NC_068434.1:40,294,952
INFO    2024-02-23 10:34:22     MarkDuplicates  Tracking 521022 as yet unmatched pairs. 21624 records in RAM.
[Fri Feb 23 10:37:38 PST 2024] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 7.87 minutes.
Runtime.totalMemory()=2075918336

I'm trying to sort out if there is a paleomix issue, something particular about this data (e.g., possible polyploidy), or if the files are somehow corrupted. Is dupcheck still a viable command? Or are there other reasons I might be seeing this error? How does paleomix deal with polyploidy?

Thanks!

My makefile is copied here:

# -*- mode: Yaml; -*-
# Timestamp: 2016-06-06T11:46:01.793329
#
# Makefile for analysing walrus samples
#
# Default options.
# Can also be specific for a set of samples, libraries, and lanes,
# by including the "Options" hierarchy at the same level as those
# samples, libraries, or lanes below. This does not include
# "Features", which may only be specific globally.
Options:
  # Sequencing platform, see SAM/BAM reference for valid values
  Platform: Illumina
  # Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+)
  # or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to
  # specify 'Solexa', to handle reads on the Solexa scale. This is
  # used during adapter-trimming and sequence alignment
  QualityOffset: 33
  # Split a lane into multiple entries, one for each (pair of) file(s)
  # found using the search-string specified for a given lane. Each
  # lane is named by adding a number to the end of the given barcode.
  SplitLanesByFilenames: yes
  # Compression format for FASTQ reads; 'gz' for GZip, 'bz2' for BZip2
  CompressionFormat: gz

  # Settings for trimming of reads, see AdapterRemoval man-page
  AdapterRemoval:
     # Adapter sequences, set and uncomment to override defaults
     --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG
     --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTNNNNNNNNGTGTAGATCTCGGTGGTCGCCGTATCATT
     # Some BAM pipeline defaults differ from AR defaults;
     # To override, change these value(s):
     --mm: 3
     --minlength: 25
     # Extra features enabled by default; change 'yes' to 'no' to disable
     --collapse: yes
     --trimns: yes
     --trimqualities: yes

  # Settings for aligners supported by the pipeline
  Aligners:
    # Choice of aligner software to use, either "BWA" or "Bowtie2"
    Program: BWA

    # Settings for mappings performed using BWA
    BWA:
      # One of "backtrack", "bwasw", or "mem"; see the BWA documentation
      # for a description of each algorithm (defaults to 'backtrack')
      Algorithm: mem
      # Filter aligned reads with a mapping quality (Phred) below this value
      MinQuality: 25
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes
      # May be disabled ("no") for aDNA alignments, as post-mortem damage
      # localizes to the seed region, which BWA expects to have few
      # errors (sets "-l"). See http://pmid.us/22574660
      UseSeed: no
      # Additional command-line options may be specified for the "aln"
      # call(s), as described below for Bowtie2 below.

    # Settings for mappings performed using Bowtie2
    Bowtie2:
      # Filter aligned reads with a mapping quality (Phred) below this value
      MinQuality: 0
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes
      # Examples of how to add additional command-line options
#      --trim5: 5
#      --trim3: 5
      # Note that the colon is required, even if no value is specified
      --very-sensitive:
      # Example of how to specify multiple values for an option
#      --rg:
#        - CN:SequencingCenterNameHere
#        - DS:DescriptionOfReadGroup

  # Mark / filter PCR duplicates. If set to 'filter', PCR duplicates are
  # removed from the output files; if set to 'mark', PCR duplicates are
  # flagged with bit 0x400, and not removed from the output files; if set to
  # 'no', the reads are assumed to not have been amplified. Collapsed reads
  # are filtered using the command 'paleomix rmdup_duplicates', while "normal"
  # reads are filtered using Picard MarkDuplicates.
  PCRDuplicates: filter

  # Carry out quality base re-scaling of libraries using mapDamage
  # This will be done using the options set for mapDamage below
  RescaleQualities: no
  # Command-line options for mapDamage; note that the long-form
  # options are expected; --length, not -l, etc. Uncomment the
  # "mapDamage" line adding command-line options below.
  mapDamage:
    # By default, the pipeline will downsample the input to 100k hits
    # when running mapDamage; remove to use all hits
    --downsample: 1000000

  # Set to 'yes' exclude a type of trimmed reads from alignment / analysis;
  # possible read-types reflect the output of AdapterRemoval
  ExcludeReads:
    Single: no              # Single-ended reads / Orphaned paired-ended reads
    Paired: no              # Paired ended reads
    Singleton: no           # Paired reads for which the mate was discarded
    Collapsed: no           # Overlapping paired-ended reads collapsed into a
                            # single sequence by AdapterRemoval
    CollapsedTruncated: no  # Like 'Collapsed', except that the reads
                            # truncated due to the presence ambigious
                            # bases or low quality bases at read termini.

  # Optional steps to perform during processing
  Features:
    PCRDuplicates: filter

    mapDamage: plot      # Generate mapDamage plot for each (unrealigned) library
                        #   Location: {Destination}/{Target}.{Genome}.mapDamage/{Library}/
    Coverage: yes       # Generate coverage information for the raw BAM (wo/ indel realignment)
                        #   Location: {Destination}/{Target}.{Genome}.coverage
    Depths: yes         # Generate histogram of number of sites with a given read-depth
                        #   Location: {Destination}/{Target}.{Genome}.depths
    Summary: yes        # Generate summary table for each target
                        #   Location: {Destination}/{Target}.summary
    DuplicateHist: no  # Generate histogram of PCR duplicates, for use with PreSeq
                        #   Location: {Destination}/{Target}.{Genome}.duphist/{Library}/

# Map of prefixes (ref genomes) by name
Prefixes:
  Oket_v2:
    Path: /scratch/st-cspeller-1/paleomix/ref_genome/GCF_023373465.1_Oket_V2_genomic.fasta

M-Toku_1:
  M-Toku_1:
    M-Toku_1_1:
      M-Toku_1: /arc/project/st-cspeller-1/sra-out/fastq/SRR11048435_{Pair}.fastq.gz
MikkelSchubert commented 5 months ago

Hi,

To give a bit of background, this check is in place to catch cases like when you accidentally include multiple copies of the same FASTQ files in your makefile or when people actually upload multiple copies of the same data to repositories. So not PCR duplicates, but 1:1 copies of FASTQ files. Picard will sometimes, but not always detect this problem, depending on where the duplicate date is located and the read type, which is why I added a separate check.

Both of these are situations I've run into and is typically a sign of somebody having misorganized data. In the first case you'll want to review what files you are using and at a minimum make sure that you are not including the same data multiple times. In the second case, you typically need to contact the original creators of the data and ask them to re-upload their data.

However, the check does have false positives and you may have run into one such instance. The paleomix dupcheck command is supposed to help you figure that out by giving you a more detailed report, but I appear to accidentally have made it inaccessible. Thanks for letting me know, I'll fix that! In the mean time you can run the dupcheck command directly via the command python3 -m paleomix.tools.dupcheck. If you have installed paleomix in a virtual or conda environment, then you need to activate it first.

That will print every identical set of reads and give you a total number as well. If you only see a small number of identical reads, then they are probably just false positives. In that case you can just run the touch commands shown in the error messages to skip the checks. But if a significant portion of the data is duplicated, then it could indicate a problem with the dataset as described above.

Best, Mikkel

laneatmore commented 5 months ago

Hi Mikkel,

Thanks so much for the detailed reply. With dupcheck I can see that it appears to be 50-100 duplicates per individual. Not a huge proportion of the data, but it is consistent across all samples. I've re-downloaded all the data and checked my Makefiles, and it seems like it's a problem with the original upload. I'll try to get that sorted out.

For the time being, the "touch" command to get around the error does seem to be working, thanks!

Best, Lane

MikkelSchubert commented 5 months ago

Hi Lane,

If it is just a few hundred reads per sample then I wouldn't worry about it.

I don't know how or why it happens, but I have previously observed completely identical mates occurring at a (very) low rate in some datasets, as seems to be the case here judging by the first error you shared. My guess is that this is a problem that occurs during sequencing or base-calling.

If so then I'm not sure you could even do anything about it, except perhaps filter the duplicate sequences. But given the low numbers it is probably not worthwhile to do so.

Best, Mikkel