Closed zkamvar closed 7 years ago
Linkies:
This discussion may have some relevance:
I think the problem might be originating from here:
If I look at the header for the bam file for the merged reads, I get the label SS.11.10_P
:
kamvarz@c2419 09:36:23/work/everhartlab/kamvarz/ssc-processing (master)> samtools view -H BAMS/SS.11.10_fixed.bam | grep '@RG'
@RG ID:HF75NBBXX.2 SM:SS.11.10_P PL:ILLUMINA LB:RUN.265
I'm going to try and use the same basename to see if that works
Dammit, i should have said "address #11"
Okay, so fixing that did not address the situation:
> samtools view BAMS/SS.11.16_fixed.bam | grep 'K00133:265:HF75NBBXX:2:1101:1113:12251'
K00133:265:HF75NBBXX:2:1101:1113:12251 80 CP017822.1_Sclerotinia_sclerotiorum_chromosome_9_complete_sequence 690388 42 70M * 0 0ATCACTACTAATCCATGAAGAGGATTATTACTCAGATGAAGAAGATGTGGGATTTATAGATGTCCCTGCG FJA7JJJJFA<JJJFJFFAJJJJFJJJJFFJJJFJFJFJFJFAJFJJJJJJFJJJJJJJFJJJFJFFFAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:70 YS:i:-3 YT:Z:DP RG:Z:HF75NBBXX.2
K00133:265:HF75NBBXX:2:1101:1113:12251 128 CP017829.1_Sclerotinia_sclerotiorum_chromosome_16_complete_sequence 761051 42 51M * 0 0TGTAGTTGCCATTTATTAACATACTTCTAATTAGATCCTTCTTGAATCGCC AAAFF<FJFJJFJJJ<FJJJJ7FFFFFFFJJJJJJAFJ<7FAAFJJAFA-F AS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:49T1 YS:i:0 YT:Z:DP RG:Z:HF75NBBXX.2
Here's the weird part, the two reads map to different chromosomes and, for the record, they are each others' pairs:
> zcat TRIM/SS.11.16_1P.fq.gz | grep -n K00133:265:HF75NBBXX:2:1101:1113:12251
73101:@K00133:265:HF75NBBXX:2:1101:1113:12251 1:N:0
> zcat TRIM/SS.11.16_2P.fq.gz | grep -n K00133:265:HF75NBBXX:2:1101:1113:12251
73101:@K00133:265:HF75NBBXX:2:1101:1113:12251 2:N:0
And they don't exist in the unpaired files, thus... these reads... are... not... unpaired?! 🤔 😡
so the link above has this quote:
That being said, the mixed read type BAM should have different read groups per data type. You can double-check this in the file header by grepping for the @RG lines
I'm trying to run samtools merge
without the -c
flag to see if this clears things up
[ narrator ] It did not clear things up
> picard ValidateSamFile I=BAMS/SS.11.16_fixed.bam MODE=SUMMARY
[Fri Oct 13 12:07:33 CDT 2017] picard.sam.ValidateSamFile INPUT=BAMS/SS.11.16_fixed.bam MODE=SUMMARY MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Oct 13 12:07:33 CDT 2017] Executing as kamvarz@c3113.tusker.hcc.unl.edu on Linux 2.6.32-696.10.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Picard version: 2.9.0-1-gf5b9f50-SNAPSHOT
## HISTOGRAM java.lang.String
Error Type Count
ERROR:INVALID_FLAG_FIRST_OF_PAIR 1
ERROR:INVALID_FLAG_SECOND_OF_PAIR 1
One of the problems here is that the reads are out of order:
> picard ValidateSamFile I=BAMS/SS.11.16_merged.bam MODE=SUMMARY
[Fri Oct 13 12:15:30 CDT 2017] picard.sam.ValidateSamFile INPUT=BAMS/SS.11.16_merged.bam MODE=SUMMARY MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Oct 13 12:15:30 CDT 2017] Executing as kamvarz@c3113.tusker.hcc.unl.edu on Linux 2.6.32-696.10.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Picard version: 2.9.0-1-gf5b9f50-SNAPSHOT
## HISTOGRAM java.lang.String
Error Type Count
ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 9778
ERROR:RECORD_OUT_OF_ORDER 210645
I'm attempting to reorder these using picard's SortSam
.
After sorting, I validated the file and got no reads that were out of order 🎉
Note: If I end up using this in my pipeline, I may sort this in query name order as opposed to coordinate because this is what
FixMateInformation
expects apparently ¯\_(ツ)_/¯
> picard SortSam I=BAMS/SS.11.16_merged.bam O=BAMS/SS.11.16_msort.bam SORT_ORDER=coordinate
[Fri Oct 13 12:19:54 CDT 2017] picard.sam.SortSam INPUT=BAMS/SS.11.16_merged.bam OUTPUT=BAMS/SS.11.16_msort.bam SORT_ORDER=coordinate VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Oct 13 12:19:54 CDT 2017] Executing as kamvarz@c3113.tusker.hcc.unl.edu on Linux 2.6.32-696.10.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Picard version: 2.9.0-1-gf5b9f50-SNAPSHOT
INFO 2017-10-13 12:34:06 SortSam Finished reading inputs, merging and writing to output now.
[Fri Oct 13 12:35:44 CDT 2017] picard.sam.SortSam done. Elapsed time: 15.83 minutes.
Runtime.totalMemory()=780140544
> picard ValidateSamFile I=BAMS/SS.11.16_msort.bam MODE=SUMMARY
[Fri Oct 13 12:37:03 CDT 2017] picard.sam.ValidateSamFile INPUT=BAMS/SS.11.16_msort.bam MODE=SUMMARY MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Oct 13 12:37:03 CDT 2017] Executing as kamvarz@c3113.tusker.hcc.unl.edu on Linux 2.6.32-696.10.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Picard version: 2.9.0-1-gf5b9f50-SNAPSHOT
## HISTOGRAM java.lang.String
Error Type Count
ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 9778
Of course, I still get the MISMATCH_FLAG_MATE_NEG_STRAND
error, but at least this version goes into FixMateInformation
now.
Stay tuned!
FUCKFUCKFUCKFUCKFUCKFUCK
> picard FixMateInformation I=BAMS/SS.11.16_msort.bam O=BAMS/SS.11.16_fixed_picard.bam
[Fri Oct 13 12:39:26 CDT 2017] picard.sam.FixMateInformation INPUT=[BAMS/SS.11.16_msort.bam] OUTPUT=BAMS/SS.11.16_fixed_picard.bam ASSUME_SORTED=false ADD_MATE_CIGAR=true IGNORE_MISSING_MATES=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Oct 13 12:39:26 CDT 2017] Executing as kamvarz@c3113.tusker.hcc.unl.edu on Linux 2.6.32-696.10.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Picard version: 2.9.0-1-gf5b9f50-SNAPSHOT
INFO 2017-10-13 12:39:26 FixMateInformation Sorting input into queryname order.
INFO 2017-10-13 12:55:26 FixMateInformation Sorting by queryname complete.
INFO 2017-10-13 12:55:26 FixMateInformation Output will be sorted by coordinate
INFO 2017-10-13 12:55:26 FixMateInformation Traversing query name sorted records and fixing up mate pair information.
INFO 2017-10-13 12:57:16 FixMateInformation Processed 1,000,000 records. Elapsed time: 00:01:50s. Time for last 1,000,000: 110s. Last read position: CP017828.1_Sclerotinia_sclerotiorum_chromosome_15_complete_sequence:907,059
INFO 2017-10-13 12:59:18 FixMateInformation Processed 2,000,000 records. Elapsed time: 00:03:52s. Time for last 1,000,000: 121s. Last read position: CP017826.1_Sclerotinia_sclerotiorum_chromosome_13_complete_sequence:1,646,323
srun: Force Terminated job 9303601
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
INFO 2017-10-13 13:01:21 FixMateInformation Processed 3,000,000 records. Elapsed time: 00:05:55s. Time for last 1,000,000: 123s. Last read position: CP017829.1_Sclerotinia_sclerotiorum_chromosome_16_complete_sequence:879,072
srun: error: Timed out waiting for job step to complete
Okay, after lunch and retrying this with SLURM_Array, I was able to succeed in removing the errors:
> picard ValidateSamFile I=BAMS/SS.11.16_fixed_picard.bam MODE=SUMMARY
[Fri Oct 13 14:35:15 CDT 2017] picard.sam.ValidateSamFile INPUT=BAMS/SS.11.16_fixed_picard.bam MODE=SUMMARY MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Oct 13 14:35:15 CDT 2017] Executing as kamvarz@c3105.tusker.hcc.unl.edu on Linux 2.6.32-696.10.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Picard version: 2.9.0-1-gf5b9f50-SNAPSHOT
No errors found
[Fri Oct 13 14:36:16 CDT 2017] picard.sam.ValidateSamFile done. Elapsed time: 1.02 minutes.
Runtime.totalMemory()=714604544
This issue was fixed in ecec76b and 55bb8f118b9edde830d18705d13e61026ff28671
I set a run last night and got the following errors marking duplicates in BAM files:
This is clearly from the inclusion of unpaired reads.