everhartlab / read-processing

A makefile and scripts for processing *S. sclerotiorum* illumina data
MIT License
0 stars 1 forks source link

Picard Error: First of pair flag should not be set for unpaired read #11

Closed zkamvar closed 7 years ago

zkamvar commented 7 years ago

I set a run last night and got the following errors marking duplicates in BAM files:

> grep -inr Exception runs/MARK-DUPS/*err
runs/MARK-DUPS/SS.11.10.err:20:Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 5183748, Read name K00133:265:HF75NBBXX:2:1101:1123:31716, First of pair flag should not be set for unpaired read.
runs/MARK-DUPS/SS.11.16.err:16:Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 3122379, Read name K00133:265:HF75NBBXX:2:1101:1113:12251, First of pair flag should not be set for unpaired read.
runs/MARK-DUPS/SS.11.35.err:12:Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 1875886, Read name K00133:265:HF75NBBXX:2:1101:1093:11091, First of pair flag should not be set for unpaired read.

This is clearly from the inclusion of unpaired reads.

zkamvar commented 7 years ago

Linkies:

zkamvar commented 7 years ago

This discussion may have some relevance:

https://gatkforums.broadinstitute.org/gatk/discussion/7814/error-mate-unmapped-flag-should-not-be-set-for-unpaired-reads-after-revertsam

zkamvar commented 7 years ago

I think the problem might be originating from here:

https://github.com/zkamvar/read-processing/blob/7b94651b669157c1d449f657690c0960251f4178/scripts/make-alignment.sh#L62-L71

If I look at the header for the bam file for the merged reads, I get the label SS.11.10_P:

kamvarz@c2419 09:36:23/work/everhartlab/kamvarz/ssc-processing (master)> samtools view -H BAMS/SS.11.10_fixed.bam | grep '@RG'
@RG ID:HF75NBBXX.2  SM:SS.11.10_P   PL:ILLUMINA LB:RUN.265

I'm going to try and use the same basename to see if that works

zkamvar commented 7 years ago

Dammit, i should have said "address #11"

zkamvar commented 7 years ago

Okay, so fixing that did not address the situation:

> samtools view BAMS/SS.11.16_fixed.bam | grep 'K00133:265:HF75NBBXX:2:1101:1113:12251'
K00133:265:HF75NBBXX:2:1101:1113:12251  80  CP017822.1_Sclerotinia_sclerotiorum_chromosome_9_complete_sequence  690388  42  70M *   0   0ATCACTACTAATCCATGAAGAGGATTATTACTCAGATGAAGAAGATGTGGGATTTATAGATGTCCCTGCG FJA7JJJJFA<JJJFJFFAJJJJFJJJJFFJJJFJFJFJFJFAJFJJJJJJFJJJJJJJFJJJFJFFFAA  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:70 YS:i:-3 YT:Z:DP RG:Z:HF75NBBXX.2
K00133:265:HF75NBBXX:2:1101:1113:12251  128 CP017829.1_Sclerotinia_sclerotiorum_chromosome_16_complete_sequence 761051  42  51M *   0   0TGTAGTTGCCATTTATTAACATACTTCTAATTAGATCCTTCTTGAATCGCC    AAAFF<FJFJJFJJJ<FJJJJ7FFFFFFFJJJJJJAFJ<7FAAFJJAFA-F AS:i:-3 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:49T1   YS:i:0  YT:Z:DP RG:Z:HF75NBBXX.2

Here's the weird part, the two reads map to different chromosomes and, for the record, they are each others' pairs:

> zcat TRIM/SS.11.16_1P.fq.gz | grep -n K00133:265:HF75NBBXX:2:1101:1113:12251
73101:@K00133:265:HF75NBBXX:2:1101:1113:12251 1:N:0

> zcat TRIM/SS.11.16_2P.fq.gz | grep -n K00133:265:HF75NBBXX:2:1101:1113:12251
73101:@K00133:265:HF75NBBXX:2:1101:1113:12251 2:N:0

And they don't exist in the unpaired files, thus... these reads... are... not... unpaired?! 🤔 😡

zkamvar commented 7 years ago

so the link above has this quote:

That being said, the mixed read type BAM should have different read groups per data type. You can double-check this in the file header by grepping for the @RG lines

I'm trying to run samtools merge without the -c flag to see if this clears things up

zkamvar commented 7 years ago

[ narrator ] It did not clear things up

> picard ValidateSamFile I=BAMS/SS.11.16_fixed.bam MODE=SUMMARY
[Fri Oct 13 12:07:33 CDT 2017] picard.sam.ValidateSamFile INPUT=BAMS/SS.11.16_fixed.bam MODE=SUMMARY    MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Oct 13 12:07:33 CDT 2017] Executing as kamvarz@c3113.tusker.hcc.unl.edu on Linux 2.6.32-696.10.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Picard version: 2.9.0-1-gf5b9f50-SNAPSHOT

## HISTOGRAM    java.lang.String
Error Type  Count
ERROR:INVALID_FLAG_FIRST_OF_PAIR    1
ERROR:INVALID_FLAG_SECOND_OF_PAIR   1

One of the problems here is that the reads are out of order:

> picard ValidateSamFile I=BAMS/SS.11.16_merged.bam MODE=SUMMARY
[Fri Oct 13 12:15:30 CDT 2017] picard.sam.ValidateSamFile INPUT=BAMS/SS.11.16_merged.bam MODE=SUMMARY    MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Oct 13 12:15:30 CDT 2017] Executing as kamvarz@c3113.tusker.hcc.unl.edu on Linux 2.6.32-696.10.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Picard version: 2.9.0-1-gf5b9f50-SNAPSHOT

## HISTOGRAM    java.lang.String
Error Type  Count
ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 9778
ERROR:RECORD_OUT_OF_ORDER   210645

I'm attempting to reorder these using picard's SortSam.

zkamvar commented 7 years ago

After sorting, I validated the file and got no reads that were out of order 🎉

Note: If I end up using this in my pipeline, I may sort this in query name order as opposed to coordinate because this is what FixMateInformation expects apparently ¯\_(ツ)_/¯

> picard SortSam I=BAMS/SS.11.16_merged.bam O=BAMS/SS.11.16_msort.bam SORT_ORDER=coordinate
[Fri Oct 13 12:19:54 CDT 2017] picard.sam.SortSam INPUT=BAMS/SS.11.16_merged.bam OUTPUT=BAMS/SS.11.16_msort.bam SORT_ORDER=coordinate    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Oct 13 12:19:54 CDT 2017] Executing as kamvarz@c3113.tusker.hcc.unl.edu on Linux 2.6.32-696.10.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Picard version: 2.9.0-1-gf5b9f50-SNAPSHOT
INFO    2017-10-13 12:34:06 SortSam Finished reading inputs, merging and writing to output now.
[Fri Oct 13 12:35:44 CDT 2017] picard.sam.SortSam done. Elapsed time: 15.83 minutes.
Runtime.totalMemory()=780140544

> picard ValidateSamFile I=BAMS/SS.11.16_msort.bam MODE=SUMMARY
[Fri Oct 13 12:37:03 CDT 2017] picard.sam.ValidateSamFile INPUT=BAMS/SS.11.16_msort.bam MODE=SUMMARY    MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Oct 13 12:37:03 CDT 2017] Executing as kamvarz@c3113.tusker.hcc.unl.edu on Linux 2.6.32-696.10.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Picard version: 2.9.0-1-gf5b9f50-SNAPSHOT

## HISTOGRAM    java.lang.String
Error Type  Count
ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 9778

Of course, I still get the MISMATCH_FLAG_MATE_NEG_STRAND error, but at least this version goes into FixMateInformation now.

Stay tuned!

zkamvar commented 7 years ago

FUCKFUCKFUCKFUCKFUCKFUCK

> picard FixMateInformation I=BAMS/SS.11.16_msort.bam O=BAMS/SS.11.16_fixed_picard.bam
[Fri Oct 13 12:39:26 CDT 2017] picard.sam.FixMateInformation INPUT=[BAMS/SS.11.16_msort.bam] OUTPUT=BAMS/SS.11.16_fixed_picard.bam    ASSUME_SORTED=false ADD_MATE_CIGAR=true IGNORE_MISSING_MATES=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Oct 13 12:39:26 CDT 2017] Executing as kamvarz@c3113.tusker.hcc.unl.edu on Linux 2.6.32-696.10.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Picard version: 2.9.0-1-gf5b9f50-SNAPSHOT
INFO    2017-10-13 12:39:26 FixMateInformation  Sorting input into queryname order.
INFO    2017-10-13 12:55:26 FixMateInformation  Sorting by queryname complete.
INFO    2017-10-13 12:55:26 FixMateInformation  Output will be sorted by coordinate
INFO    2017-10-13 12:55:26 FixMateInformation  Traversing query name sorted records and fixing up mate pair information.
INFO    2017-10-13 12:57:16 FixMateInformation  Processed     1,000,000 records.  Elapsed time: 00:01:50s.  Time for last 1,000,000:  110s.  Last read position: CP017828.1_Sclerotinia_sclerotiorum_chromosome_15_complete_sequence:907,059
INFO    2017-10-13 12:59:18 FixMateInformation  Processed     2,000,000 records.  Elapsed time: 00:03:52s.  Time for last 1,000,000:  121s.  Last read position: CP017826.1_Sclerotinia_sclerotiorum_chromosome_13_complete_sequence:1,646,323
srun: Force Terminated job 9303601
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
INFO    2017-10-13 13:01:21 FixMateInformation  Processed     3,000,000 records.  Elapsed time: 00:05:55s.  Time for last 1,000,000:  123s.  Last read position: CP017829.1_Sclerotinia_sclerotiorum_chromosome_16_complete_sequence:879,072
srun: error: Timed out waiting for job step to complete
zkamvar commented 7 years ago

Okay, after lunch and retrying this with SLURM_Array, I was able to succeed in removing the errors:

> picard ValidateSamFile I=BAMS/SS.11.16_fixed_picard.bam MODE=SUMMARY
[Fri Oct 13 14:35:15 CDT 2017] picard.sam.ValidateSamFile INPUT=BAMS/SS.11.16_fixed_picard.bam MODE=SUMMARY    MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Oct 13 14:35:15 CDT 2017] Executing as kamvarz@c3105.tusker.hcc.unl.edu on Linux 2.6.32-696.10.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Picard version: 2.9.0-1-gf5b9f50-SNAPSHOT
No errors found
[Fri Oct 13 14:36:16 CDT 2017] picard.sam.ValidateSamFile done. Elapsed time: 1.02 minutes.
Runtime.totalMemory()=714604544
zkamvar commented 7 years ago

This issue was fixed in ecec76b and 55bb8f118b9edde830d18705d13e61026ff28671