NCI-RBL / iCLIP

RNA Biology Pipeline to Characterize protein-RNA Interactions
https://rbl-nci.github.io/iCLIP/
MIT License
4 stars 2 forks source link

KO sample errors #81

Closed slsevilla closed 3 years ago

slsevilla commented 3 years ago

KO samples are failing at samtools_cleanup.

Data dir: /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/

The complete rule is:

samtools view /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/04_sam/01_alignment/KO_fCLIP.unmasked.split.12.sam | awk '{ if (($4 == 1 && $6!~/^[0-9]I/ && $1~/:/ )||($4 > 1 && $1~/:/ )) { print } }' > /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/04_sam/02_cleanup/KO_fCLIP.unmasked.split.12.tmp.sam;
samtools view -H /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/04_sam/01_alignment/KO_fCLIP.unmasked.split.12.sam | cat - /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/04_sam/02_cleanup/KO_fCLIP.unmasked.split.12.tmp.sam > /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/04_sam/02_cleanup/KO_fCLIP.unmasked.split.12.final.sam; 
samtools view -f 4 /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/04_sam/01_alignment/KO_fCLIP.unmasked.split.12.sam > /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/04_sam/02_cleanup/KO_fCLIP.unmasked.split.12.final.unmapped.sam;

Error message is associated with the first step: [W::sam_read1_sam] Parse error at line 38557574 samtools view: error reading file "/data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/04_sam/01_alignment/KO_fCLIP.unmasked.split.12.sam

When I look at this line, I'm not seeing anything that should cause this error.. (base) [sevillas2@cn0900 6-22-21-HaCaT_fCLIP]$ awk 'NR==38557574' /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/04_sam/01_alignment/KO_fCLIP.unmasked.split.12.sam VH00271:4:AAAF3KJHV:2:1311:36769:40017:rbc:ACCACGCAG 256 chr21 89882514I33M49H * 0 0 GAC

I pulled a few lines before and after, and still nothing that stands out (base) [sevillas2@cn0900 6-22-21-HaCaT_fCLIP]$ awk 'NR>38557570 && NR<38557580' /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/04_sam/01_alignment/KO_fCLIP.unmasked.split.12.sam VH00271:4:AAAF3KJHV:2:1311:36769:40017:rbc:ACCACGCAG 0 KI270733.1 171091 0 4I33M49H * 0 0 GACTGTGAAACTGCGAATGGCTCATTAAATCAGTTAT CCCCCCCCCCCCCCCCCCCCCCBC@?=<?<ACBBDBB PG:Z:novoalign AS:i:52UQ:i:52 NM:i:4 MD:Z:33 CC:Z:ML143377.1 CP:i:484080 ZS:Z:R NH:i:12 HI:i:1 IH:i:12 VH00271:4:AAAF3KJHV:2:1311:36769:40017:rbc:ACCACGCAG 256 ML143377.1 484080 0 4I33M49H * 0 0 GACTGTGAAACTGCGAATGGCTCATTAAATCAGTTAT CCCCCCCCCCCCCCCCCCCCCCBC@?=<?<ACBBDBB PG:Z:novoalign AS:i:52UQ:i:52 NM:i:4 MD:Z:33 CC:Z:chr16 CP:i:34160109 ZS:Z:R NH:i:12 HI:i:2 IH:i:12 VH00271:4:AAAF3KJHV:2:1311:36769:40017:rbc:ACCACGCAG 256 chr16 34160104I33M49H * 0 0 GACTGTGAAACTGCGAATGGCTCATTAAATCAGTTAT CCCCCCCCCCCCCCCCCCCCCCBC@?=<?<ACBBDBB PG:Z:novoalign AS:i:52 UQ:i:52 NM:i:4 MD:Z:33CC:Z:chr21 CP:i:8988251 ZS:Z:R NH:i:12 HI:i:3 IH:i:12 VH00271:4:AAAF3KJHV:2:1311:36769:40017:rbc:ACCACGCAG 256 chr21 89882514I33M49H * 0 0 GAC

I ran this step with another input file (/data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/04_sam/01_alignment/KO_fCLIP.unmasked.split.11.sam) and am still getting the same error: (base) [sevillas2@cn0900 6-22-21-HaCaT_fCLIP]$ samtools view /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/04_sam/01_alignment/KO_fCLIP.unmasked.split.11.sam | awk '{ if (($4 == 1 && $6!~/^[0-9]I/ && $1~/:/ )||($4 > 1 && $1~/:/ )) { print } }' > /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/04_sam/02_cleanup/KO_fCLIP.unmasked.split.11.tmp.sam [W::sam_read1_sam] Parse error at line 42718713 samtools view: error reading file "/data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/04_sam/01_alignment/KO_fCLIP.unmasked.split.11.sam"

Again, pulling lines before and after: (base) [sevillas2@cn0900 6-22-21-HaCaT_fCLIP]$ awk 'NR>42718710 && NR<42718720' /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/04_sam/01_alignment/KO_fCLIP.unmasked.split.11.sam VH00271:4:AAAF3KJHV:1:2501:23476:16240:rbc:GAGGACCAT 272 chr1 913872886M * 0 0 ATAGGAAGAGCCGAAATCGAAGGATCAAAAAGCAACGTCGCTATGAACGCTTGGCTGCCACAAGCCAGTTATCCCTGTGGTAACTT ;CCC?BB?B:CCC:**CC.BA2=ACB=@@@@AA338B;3?5CDC@=>@<BBBDE?ABCD>C:BCCCCCCCCCCCCCCCCCCCCCCC PG:Z:novoalign AS:i:91 UQ:i:91NM:i:3 MD:Z:14C18G21C30 CC:Z:ML143377.1 CP:i:448415 ZS:Z:R NH:i:10HI:i:8 IH:i:10 VH00271:4:AAAF3KJHV:1:2501:23476:16240:rbc:GAGGACCAT 256 ML143377.1 448415 0 86M * 0 0 AAGTTACCACAGGGATAACTGGCTTGTGGCAGCCAAGCGTTCATAGCGACGTTGCTTTTTGATCCTTCGATTTCGGCTCTTCCTAT CCCCCCCCCCCCCCCCCCCCCCCB:C>DCBA?EDBBB<@>=@CDC5?3;B833AA@@@@=BCA=2AB.CC**:CCC:B?BB?CCC; PG:Z:novoalign AS:i:91 UQ:i:91 NM:i:3 MD:Z:30G21C18G14 CC:Z:KI270733.1 CP:i:134590 ZS:Z:R NH:i:10 HI:i:9 IH:i:10 VH00271:4:AAAF3KJHV:1:2501:23476:16240:rbc:GAGGACCAT 256 KI270733.1 134590 0 86M * 0 0 AAGTTACCACAGGGATAACTGGCTTGTGGCAGCCAAGCGTTCATAGCGACGTTGCTTTTT

slsevilla commented 3 years ago

error with feature counts.

example input:

 featureCounts -F SAF -a /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/11_SAF/WT_fCLIP_all.SAF -O -J --fraction --minOverlap 1             -s 1 -T 8 -o /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/12_counts/allreadpeaks/WT_fCLIP_uniqueCounts.txt             /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/09_dedup/02_sorted/WT_fCLIP.dedup.si.bam;

featureCounts -F SAF -a /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/11_SAF/WT_fCLIP_all.SAF -M -O -J --fraction --minOverlap 1 -s 1 -T 8 -o /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/12_counts/allreadpeaks/WT_fCLIP_allFracMMCounts.txt             /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/09_dedup/02_sorted/WT_fCLIP.dedup.si.bam

ERROR: no features were loaded in format SAF. The annotation format can be specified by the '-F' option.

The full error log is here: /data/RBL_NCI/Wolin/6-22-21-HaCaT_fCLIP/log/20210722_1304/31a_feature_counts_allreads.out

slsevilla commented 3 years ago

samtools cleanup error was likely to do alignment not completing. samples re-started

SAF files did not compile and may be causing the error. SAF recreated in attempts to clear error

slsevilla commented 3 years ago

Sample files did not complete alignment, and was causing the error. Changing the multimapping flag to 10 max allowed for mapping to complete in 5.5 days. Samples were moved through cleanup without error.