FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
53 stars 20 forks source link

0 Reads Actually Processed (v0.6.0) #83

Closed NikoBlank closed 4 days ago

NikoBlank commented 5 days ago

Hi there,

To put the problem up front, I'm getting empty outputs at the end of running SNPsplit. Excerpts from the log:

Allele-tagging report
=====================
Processed 64080820 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
Reads were hard-clipped (CIGAR: H) and skipped: 0
0 reads were unassignable (0.00%)
0 reads were specific for genome 1 (0.00%)
0 reads were specific for genome 2 (0.00%)
0 reads did not contain one of the expected bases at known SNP positions (0.00%)
0 contained conflicting allele-specific SNPs (0.00%)

SNP coverage report
===================
SNP annotation file:    /absolute/path/to/snpSplitMm39/all_SNPs_CAST_EiJ_GRCm39.txt.gz
SNPs stored in total:   20459869
N-containing reads:     0
non-N:                  0
total:                  64080820
Reads had a deletion of the N-masked position (and were thus called Conflicting):       0 (0.00%)
Of which had multiple deletions of N-masked positions within the same read:     0 (0.00%)

Of valid N containing reads,
N was present in the list of known SNPs:        0 (N/A%)
N was not present in the list of SNPs:          0 (N/A%)

To walk you through my pipeline, I:

  1. Manually filtered the Mouse Genomes Project v8 SNPs VCF file down to high confidence CAST/EiJ SNPs (for use in a separate pipeline)
  2. Edited the chromosome names to be chr# format using bcftools to match the UCSC mm39 genome & GENCODE annotations I use
  3. Performed SNPsplit_genome_preparation using the filtered & edited VCF file
  4. Used STAR v2.7.11b to build an index of the N-masked genome
  5. Used STAR to align my paired-end FASTQ files (trimmed with Trim Galore! v0.6.5) to the N-masked genome
  6. Performed SNPsplit with the --paired flag

I have also confirmed separately that:

  1. The genome was properly N-masked (20459869 Ns were newly introduced into the N-masked genome for strain CAST_EiJ in total from the SNPsplit_genome_preparation output)
  2. There are no unaligned reads in the original data (samtools view -c -f 4 <BAM file> outputs 0)
  3. The SNP file produced by SNPsplit_genome_preparation has the correct chromosome names (gunzip -c all_SNPs_CAST_EiJ_GRCm39.txt.gz | head outputs a first line of 7305311 chr6 73287813 1 A/G with ensuing lines also having chr#-formatted chromosome notation in column 2)

As far as I can tell, this rules out the execution errors present in #49 and #74. I don't know what else could be causing this issue of reads being "processed" but not actually processed. Any ideas?

FelixKrueger commented 4 days ago

From the report you linked it is clear that of the 64 million reads processed not a single one overlapped with a genomic N: N-containing reads: 0. In this case I don't think that there necessarily is a mismatch between the chromosome name and the annotation (which oftentimes is a usual suspect), but:

  1. either the N-masking has not worked as expected
  2. STAR does no longer support mapping to ambiguity nucleotides in the genome (similar to BWA which replaces Ns with a random nucleotide for mapping)
  3. you accidentally did not index or align to the N-masked genome

To test point 1, can you either re-run the genome preparation step and and look at the more fine grained messages on screen, which should tell you if new SNPs had been introduced correctly per chromosome. For this step, the names of the chromosomes will of course be very important, here is how I test for the names in the FastA file (which need to match up with the VCF file):

grep \> GRCm39_CAST_EiJ_N-masked.fa 
>1
>10
>11
>12
>13
>14
>15
>16
>17
>18
>19
>2
>3
>4
>5
>6
>7
>8
>9
...

You should also be able to grep through the new N-masked genome to convince yourself that the Ns are now there:

grep -v NNNN GRCm39_CAST_EiJ_N-masked.fa | grep N | head -4
AAATGAAACCCAAAGTGGAGGTCCACTAGTGAGTATAATGATGGCAGGTATTTTTGCTGTGTGTGAGAACAGTGTATAAATAAAGATACATCATCGCTNT
CTGGCCATCACTGGATAGAGAGGCCCATTGGACTTGCAAAATTTATATGCCCCAGTACAGGGGAACGCCAGGGCAAAAANTTGTGAGTGGGTGGGTAAGG
GAGTTTGGGGGGGGGTATTGGGGACTTTTGGGATAGCATTGAAACTGTAAATGAGGGAAATACCTAATAAAAAATTAAAAAAAATACAAANACCATTGAT
TTACAGAAGAAGATAGTGAATGNTCTCTCTATGCAAACTCCTTTTCTTAGTTTATCTAAAAGTATCTGGACCACTATCCTGTACATTCATAAAATTTAAA

(grep -v NNNN ensures that you don't display lines for the 3 million Ns at starts of chromosomes where the sequence is unclear...).

This counts the number of lines with introduced Ns (won't be 20 million as there may be several Ns per line):

grep -v NNNN GRCm39_CAST_EiJ_N-masked.fa | grep N | wc -l
11459761

Once you have confirmed that this is working, reading up on the STAR behaviour would be the next thing. As a quick check you could as a small subset of your data and use Bowtie2 or something similar on the data if it is STAR itself.

LEt me know how you get on.

NikoBlank commented 4 days ago

To confirm, the expectation if a read doesn't overlap an N is that it simply doesn't get counted by any step of the process? Because the report also says non-N: 0...

I'm fairly certain the N-masking has worked as expected; to pull an excerpt from the log,

Processing chromosome chr12 (for strain CAST_EiJ)
Reading SNPs from file /absolute/path/to/snpSplitMm39/SNPs_CAST_EiJ/chrchr12.txt
Writing modified chromosome (N-masking)
Writing N-masked output to: /absolute/path/to/snpSplitMm39/CAST_EiJ_N-masked/chrchr12.N-masked.fa
957482 SNPs total for chromosome chr12
957482 positions on chromosome chr12 were changed to 'N'

and looking at the chromosome 12 FASTA file,

$ grep -v "NNNN" CAST_EiJ_N-masked/chrchr12.N-masked.fa | grep "N" | wc -l
534663

so that at least eliminates point 1 as a possible error. For point 3, I don't want to be the fool that brazenly says "I definitely haven't done this" and then discovers there's a typo they missed, so here are the commands for STAR indexing & alignment and the files in snpSplitMm39/CAST_EiJ_N-masked:

STAR --runThreadN 4 --runMode genomeGenerate --genomeDir /absolute/path/to/snpSplitMm39/CAST_EiJ_N-masked/ --genomeFastaFiles /absolute/path/to/snpSplitMm39/CAST_EiJ_N-masked/*.fa --sjdbGTFfile /absolute/path/to/snpSplitMm39/CAST_EiJ_N-masked/../../mm39/annotations/gencode.vM36.basic.annotation.gtf --sjdbOverhang 100

---

STAR --runThreadN 4 --genomeDir /absolute/path/to/snpSplitMm39/CAST_EiJ_N-masked/ --readFilesIn /absolute/path/to/project/directory/trimmedReads/E10.5_BxC_embryo_7_r1_val_1.fq /absolute/path/to/project/directory/trimmedReads/E10.5_BxC_embryo_7_r2_val_2.fq --outSAMtype BAM Unsorted --outFileNamePrefix /absolute/path/to/project/directory/snpSplitAlignedFiles/E10.5_BxC_embryo_7_ --outSAMattrRGline ID:job1 PU:uk SM:E10.5_BxC_embryo_7 PL:illumina LB:uk

---

$ ls snpSplitMm39/CAST_EiJ_N-masked/
chrchr10.N-masked.fa                   chrchr1_GL456221v1_random.N-masked.fa  chrchr5.N-masked.fa                    chrchrUn_GL456372v1.N-masked.fa  chrchrUn_JH584304v1.N-masked.fa        exonGeTrInfo.tab
chrchr11.N-masked.fa                   chrchr1_GL456239v1_random.N-masked.fa  chrchr6.N-masked.fa                    chrchrUn_GL456378v1.N-masked.fa  chrchrUn_MU069435v1.N-masked.fa        exonInfo.tab
chrchr12.N-masked.fa                   chrchr1_MU069434v1_random.N-masked.fa  chrchr7_GL456219v1_random.N-masked.fa  chrchrUn_GL456379v1.N-masked.fa  chrchrX_GL456233v2_random.N-masked.fa  geneInfo.tab
chrchr13.N-masked.fa                   chrchr1.N-masked.fa                    chrchr7.N-masked.fa                    chrchrUn_GL456381v1.N-masked.fa  chrchrX.N-masked.fa                    Genome
chrchr14.N-masked.fa                   chrchr2.N-masked.fa                    chrchr8.N-masked.fa                    chrchrUn_GL456382v1.N-masked.fa  chrchrY_JH584300v1_random.N-masked.fa  genomeParameters.txt
chrchr15.N-masked.fa                   chrchr3.N-masked.fa                    chrchr9.N-masked.fa                    chrchrUn_GL456383v1.N-masked.fa  chrchrY_JH584301v1_random.N-masked.fa  SA
chrchr16.N-masked.fa                   chrchr4_JH584295v1_random.N-masked.fa  chrchrM.N-masked.fa                    chrchrUn_GL456385v1.N-masked.fa  chrchrY_JH584302v1_random.N-masked.fa  SAindex
chrchr17.N-masked.fa                   chrchr4.N-masked.fa                    chrchrUn_GL456359v1.N-masked.fa        chrchrUn_GL456387v1.N-masked.fa  chrchrY_JH584303v1_random.N-masked.fa  sjdbInfo.txt
chrchr18.N-masked.fa                   chrchr5_GL456354v1_random.N-masked.fa  chrchrUn_GL456360v1.N-masked.fa        chrchrUn_GL456389v1.N-masked.fa  chrchrY.N-masked.fa                    sjdbList.fromGTF.out.tab
chrchr19.N-masked.fa                   chrchr5_JH584296v1_random.N-masked.fa  chrchrUn_GL456366v1.N-masked.fa        chrchrUn_GL456390v1.N-masked.fa  chrLength.txt                          sjdbList.out.tab
chrchr1_GL456210v1_random.N-masked.fa  chrchr5_JH584297v1_random.N-masked.fa  chrchrUn_GL456367v1.N-masked.fa        chrchrUn_GL456392v1.N-masked.fa  chrNameLength.txt                      transcriptInfo.tab
chrchr1_GL456211v1_random.N-masked.fa  chrchr5_JH584298v1_random.N-masked.fa  chrchrUn_GL456368v1.N-masked.fa        chrchrUn_GL456394v1.N-masked.fa  chrName.txt
chrchr1_GL456212v1_random.N-masked.fa  chrchr5_JH584299v1_random.N-masked.fa  chrchrUn_GL456370v1.N-masked.fa        chrchrUn_GL456396v1.N-masked.fa  chrStart.txt

I've also checked, chrName.txt has all of the chromosomes in chr# format (and thus so do all of the FASTA files):

chr10
chr11
chr12
chr13
chr14
chr15
chr16
...

All of which I guess leaves point 2 as a possibility, unless there's something else we've missed. I don't see anything in the STAR documentation that says it skips mapping to ambiguous nucleotides, nor do I see any settings that would allow you to alter such behavior. What's the expected behavior of STAR, such that SNPsplit can ID reads spanning an N?

I'll also try doing an alignment with Bowtie2, see if that output works as expected.

NikoBlank commented 4 days ago

Okay, just finished running through the pipeline with Bowtie2 and it seems to have worked properly:

Allele-tagging report
=====================
Processed 79097426 read alignments in total
Reads were unaligned and hence skipped: 16770036 (21.20%)
48709264 reads were unassignable (61.58%)
6271474 reads were specific for genome 1 (7.93%)
6792763 reads were specific for genome 2 (8.59%)
381508 reads did not contain one of the expected bases at known SNP positions (0.48%)
553889 contained conflicting allele-specific SNPs (0.70%)

SNP coverage report
===================
SNP annotation file:    /ix1/mmann/chrisPai/snpSplitMm39/all_SNPs_CAST_EiJ_GRCm39.txt.gz
SNPs stored in total:   20459869
N-containing reads:     13955456
non-N:                  48327756
total:                  79097426
Reads had a deletion of the N-masked position (and were thus dropped):  44178 (0.06%)
Of which had multiple deletions of N-masked positions within the same read:     94

Of valid N containing reads,
N was present in the list of known SNPs:        22726041 (98.77%)
N was not present in the list of SNPs:          281952 (1.23%)

I would rather not do Yet Another Alignment for all of my FASTQ files if I can help it, though, especially since I don't know why it works with Bowtie2 but not STAR (also just, I don't really like working with Bowtie2). If you have ideas for things I should check or try, let me know.

FelixKrueger commented 4 days ago

Now that's interesting... At least it shows that all the hard work you put in at the start of the workflow was all fine!

Here is what I had noted down for STAR alignments (which I believe was confirmed by a group in Oxford some years ago):

https://felixkrueger.github.io/SNPsplit/SNPsplit/specific_comments/#star

Can you just take a few reads (e.g. 1M) of one sample and try running the alignments using --outSAMattributes NH HI NM MD --outSAMtype BAM Unsorted?

I appreciate that you don't really want to go through the alignments again, but I am afraid as it stands the BAM files don't appear to have information on the N-masked positions :( You could potentially run some sort of pileup on the data and evaluate the output at the high confidence position, but this will probably also take some time...

If this is RNA-seq data then making it work with STAR, or using HISAT2 are probably your best options, but if you are no friend of Bowtie2 then I am sure this will extend to HISAT2....

NikoBlank commented 4 days ago

Ohhhhh okay, there's the issue! I missed the specific comments page while reading the docs, so I didn't see that I needed to add an --outSAMattributes argument. I'll redo my alignments and make sure to include the MD tag this time. I'll report back when it's done to confirm, but I suspect this has solved the mystery.

NikoBlank commented 4 days ago

Confirmed, it's now working properly:

Allele-tagging report
=====================
Processed 64080820 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
Reads were hard-clipped (CIGAR: H) and skipped: 0
52807231 reads were unassignable (82.41%)
5161771 reads were specific for genome 1 (8.06%)
6008636 reads were specific for genome 2 (9.38%)
101627 reads did not contain one of the expected bases at known SNP positions (0.16%)
103182 contained conflicting allele-specific SNPs (0.16%)

SNP coverage report
===================
SNP annotation file:    /ix1/mmann/chrisPai/snpSplitMm39/all_SNPs_CAST_EiJ_GRCm39.txt.gz
SNPs stored in total:   20459869
N-containing reads:     11355946
non-N:                  52705604
total:                  64080820
Reads had a deletion of the N-masked position (and were thus called Conflicting):       19270 (0.03%)
Of which had multiple deletions of N-masked positions within the same read:     4 (0.00%)

Of valid N containing reads,
N was present in the list of known SNPs:        17962722 (99.61%)
N was not present in the list of SNPs:          69850 (0.39%)

Thanks for the assistance in troubleshooting!

FelixKrueger commented 4 days ago

Excellent, glad it works now!