Zero segregation for scRNAseq

GangLiTarheel commented 1 year ago

Hi I am trying to run SNPsplit on F1 hybrid mice single cell RNA-seq data. For our scRNAseq data, one read only has barcode information, one read has real sequence information used for alignment. See some example bam info:

LIG-P2-D06_RT-P4-C09,PCR-PA-01A,TTCACGAC,E13.5_L3-04    16      chr1    3620730     255     99M     *  0    0       AGGGCTCGATTCGTGGAAAGATAATGTGTGAATTTGGTTTAGTCATGGAATACTTTGGTTTCTCCATCTATGGTAATTGAAAGTTTAGCTGGGTATAGT :FFFFFF,FFFFFFFFFFFFFFFF,FFFFFF:F::FFFFFFFFFFFFFFF:FFF,FFFFFFFFFFFFF:FFFFFFFFFFFFFFF:F:FFFFF:FFFFFF     NH:i:1  HI:i:1  AS:i:96 nM:i:0

LIG-P3-C06_RT-P4-H04,PCR-PA-01A,TGTACGAG,E13.5_L4-04    16      chr1    3620730     255     98M1S   *  0    0       AGGGCTCGATTCGTGGAAAGATAATGTGTGAATTTGGTTTAGTCATGGAATACTTTGGTTTCTCCATCTATGGTAATTGAAAGTTTAGCTGGGTATAGN FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF#     NH:i:1  HI:i:1  AS:i:95 nM:i:0

I think I should use single-end model with SNPsplit. SNPsplit \ --snp_file "${snp}" \ -o "${outdir}" \ --single_end \ --conflicting However, it seems that it fails to assign any reads/detect N-containing reads. May I ask if I have to adjust other parameters to run SNPsplit for this kind of data?

Allele-tagging report
=====================
Processed 8828174 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
Reads were hard-clipped (CIGAR: H) and skipped: 0
0 reads were unassignable (0.00%)
0 reads were specific for genome 1 (0.00%)
0 reads were specific for genome 2 (0.00%)
0 reads did not contain one of the expected bases at known SNP positions (0.00%)
0 contained conflicting allele-specific SNPs (0.00%)

SNP coverage report
===================
SNP annotation file:    all_SNPs_CAST_EiJ_GRCm38.chr.txt.gz
SNPs stored in total:   20668547
N-containing reads: 0
non-N:          0
total:          8828174
Reads had a deletion of the N-masked position (and were thus called Conflicting):   0 (0.00%)
Of which had multiple deletions of N-masked positions within the same read: 0 (0.00%)

Of valid N containing reads,
N was present in the list of known SNPs:    0 (N/A%)
N was not present in the list of SNPs:      0 (N/A%)

FelixKrueger commented 1 year ago

Hi @GangLiTarheel

SNPsplit is intended to be used with alignments to an N-masked genome, and you do not seem to have aligned the data to such a genome:

N-containing reads: 0

You need to start with this step: http://felixkrueger.github.io/SNPsplit/genome_prep/genome_preparation/

then subsequent steps should also work.

GangLiTarheel commented 1 year ago

Hi Felix, Thank you for your quick responses! I just double-checked that I am aligning to an N-masked genome.

The same Nmasked .fa file works well with SNPsplit on the pair-end ATAC-seq data but with the bowtie2 aligner.
I also tried to use the ensemble mouse genome, instead of the GENCODE version, and regenerate the STAR index file, but I got the same results with 0 segregation.

Could the STAR parameter affect the SNPsplit? The only parameter might make a difference is： --outSAMstrandField intronMotif: This option specifies the output format for the strandness of the alignment. In this case, it will output the intron motif.

My current STAR aligning process: STAR --runThreadN $core --outSAMstrandField intronMotif --genomeDir $index --readFilesCommand zcat --readFilesIn $input_folder/$sample*gz --outFileNamePrefix $STAR_output_folder/$sample --genomeLoad LoadAndKeep --outReadsUnmapped Fastx

FelixKrueger commented 1 year ago

Sorry for the slow response, I am now back...

Regarding the alignment options for STAR, here are some comments about it: http://felixkrueger.github.io/SNPsplit/SNPsplit/specific_comments/#star. Specifically, it mentions that you need to instruct STAR to output the MD:Z: field (e.g. with: --outSAMattributes NH HI NM MD), which I could not see in the example you posted above. I wonder if that is the reason?

Did you use the SNPsplit genome preparation to write out N-masked sequences, and did you use those N-masked.fa files for the indexing process? There should be reports (a SNP filtering (e.g. JF1_MsJ_SNP_filtering_report.txt) and a genome preparation report (e.g. JF1_MsJ_genome_preparation_report.txt) where you should see if everything worked well):

SNP position summary for strain JF1_MsJ (based on genome build GRCm39)
===========================================================================

Positions read in total:    83212162

21416624    SNP were homozygous. Of these:
20013481    SNP were homozygous and passed high confidence filters and were thus included into the JF1_MsJ genome

Not included into JF1_MsJ genome:
58142404    had the same sequence as the reference
0       had no clearly defined alternative base
3653134     Calls were neither 0/0 (same as reference) or 1/1, 2/2, 3/3 (homozygous SNP)
1403143     were homozygous but the filtering call was low confidence

Printed a single list of all SNPs to >all_SNPs_JF1_MsJ_GRCm39.txt.gz<...

1561209 positions on chromosome 1 were changed to 'N'
1122032 positions on chromosome 10 were changed to 'N'
959608 positions on chromosome 11 were changed to 'N'
851377 positions on chromosome 12 were changed to 'N'
946731 positions on chromosome 13 were changed to 'N'
840234 positions on chromosome 14 were changed to 'N'
904815 positions on chromosome 15 were changed to 'N'
822652 positions on chromosome 16 were changed to 'N'
753564 positions on chromosome 17 were changed to 'N'
790122 positions on chromosome 18 were changed to 'N'
525431 positions on chromosome 19 were changed to 'N'
1374275 positions on chromosome 2 were changed to 'N'
1322050 positions on chromosome 3 were changed to 'N'
1207025 positions on chromosome 4 were changed to 'N'
1270735 positions on chromosome 5 were changed to 'N'
1181995 positions on chromosome 6 were changed to 'N'
1105527 positions on chromosome 7 were changed to 'N'
681900 positions on chromosome 8 were changed to 'N'
984742 positions on chromosome 9 were changed to 'N'
807457 positions on chromosome X were changed to 'N'

Summary
20013481 Ns were newly introduced into the N-masked genome for strain JF1_MsJ in total

GangLiTarheel commented 1 year ago

Hi Felix, you are absolutely right about the MD string, now it's working! For me to make it work:

use ensemble mouse genome rather than GENCODE version;
add --outSAMattributes NH HI NM MD You can close it now! Thank you for your help!

FelixKrueger commented 1 year ago

Excellent! Good luck!

FelixKrueger / SNPsplit

Zero segregation for scRNAseq #77