Why all my reads are unassigned

skelviper commented 3 years ago

Hi! SNPsplit seems like a perfect tool for allele specific research, nice work!

Here is my question: I hope to look at RNA expression at individual chromosome, but although my reads are processed, no read were assigned.

my code:

#get SNPsplit format snp from sanger vcf
SNPsplit_genome_preparation --vcf_file mgp.v5.merged.snps_all.dbSNP142.vcf --reference_genome ../raw_data/ --strain CAST_EiJ --strain2 C57BL_6NJ

#generate star index
STAR --runMode genomeGenerate --runThreadN 80 --genomeDir ./CAST_B6_star_index --genomeFastaFiles CAST_B6/CAST_EiJ_C57BL_6NJ_dual_hybrid.based_on_GRCm38_full_sequence/chr*.fa

#map RNA reads to ref genome
STAR --runThreadN 20 --genomeDir ./CAST_B6_star_index --readFilesIn RNA_all/umibycell.E752003.rna.R1.fq --outFileNamePrefix starAlign/star. --outSAMtype BAM Unsorted --alignEndsType EndToEnd --outSAMattributes NH HI NM MD --outReadsUnmapped Fastx

#use SNP_split
SNPsplit --snp_file ./CAST_B6/all_C57BL_6NJ_SNPs_CAST_EiJ_reference.based_on_GRCm38.txt ./starAlign/star.Aligned.out.bam -o ./res

SNPsplit output:

Created output directory ./res/!

Output will be written into the directory: /shareb/zliu/test/RNAsplit/res/
Testing if input file './starAlign/star.sort.bam' looks like a Bisulfite-Seq file
Treating file(s) as single-end data (as extracted from @PG line)

Treating file(s) as single-end data (as extracted from @PG line)

Treating file(s) as single-end data (as extracted from @PG line)

Now starting to process file <<< './starAlign/star.sort.bam' >>>

Input file:                                     'star.sort.bam'
Writing SNPplit-tag report to:                  'star.sort.SNPsplit_report.txt'
Writing allele-flagged output file to:          'star.sort.allele_flagged.bam'

Summary of parameters for SNPsplit-tag:
========================================
SNPsplit infile:                ./starAlign/star.sort.bam
SNP annotation file:            ./CAST_B6/all_C57BL_6NJ_SNPs_CAST_EiJ_reference.based_on_GRCm38.txt
Output directory:               >/shareb/zliu/test/RNAsplit/res/<
Parent directory:               >/shareb/zliu/test/RNAsplit<
Samtools path:                  /share/home/zliu/miniconda3/envs/chip/bin/samtools
Output format:                  BAM (default)
Input format:                   Single-End

File specified as single-end BAM file
Storing SNP positions provided in './CAST_B6/all_C57BL_6NJ_SNPs_CAST_EiJ_reference.based_on_GRCm38.txt'
Stored 20635313 positions in total

Reading from sorted mapping file './starAlign/star.sort.bam'

Allele-tagging report
=====================
Processed 195953 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
Reads were hard-clipped (CIGAR: H) and skipped: 0
195953 reads were unassignable (100.00%)
0 reads were specific for genome 1 (0.00%)
0 reads were specific for genome 2 (0.00%)
0 reads did not contain one of the expected bases at known SNP positions (0.00%)
0 contained conflicting allele-specific SNPs (0.00%)

SNP coverage report
===================
SNP annotation file:    ./CAST_B6/all_C57BL_6NJ_SNPs_CAST_EiJ_reference.based_on_GRCm38.txt
SNPs stored in total:   20635313
N-containing reads:     0
non-N:                  195953
total:                  195953
Reads had a deletion of the N-masked position (and were thus called Conflicting):       0 (0.00%)
Of which had multiple deletions of N-masked positions within the same read:     0 (0.00%)

Of valid N containing reads,
N was present in the list of known SNPs:        0 (N/A%)
N was not present in the list of SNPs:          0 (N/A%)

Finished allele-tagging for file <<< './starAlign/star.sort.bam' >>>

Summary of parameters for SNPsplit-sort:
========================================
SNPsplit tagged infile:         star.sort.allele_flagged.bam
Output directory:               >/shareb/zliu/test/RNAsplit/res/<
Parent directory:               >/shareb/zliu/test/RNAsplit<
Samtools path:                  /share/home/zliu/miniconda3/envs/chip/bin/samtools
Output format:                  BAM (default)
Input format:                   Single-End

Now processing input file <<< star.sort.allele_flagged.bam >>>

Writing YAML report to star.sort.SNPsplit_sort.yaml

Input file:                                             'star.sort.allele_flagged.bam'
Writing SNPsplit-sort report to:                        'star.sort.SNPsplit_sort.txt'
Writing unassigned reads to:                            'star.sort.unassigned.bam'
Writing genome 1-specific reads to:                     'star.sort.genome1.bam'
Writing genome 2-specific reads to:                     'star.sort.genome2.bam'

![image-20210703130704301](https://user-images.githubusercontent.com/16440619/124344013-8c7d4f80-dc02-11eb-8ed9-805019504a40.png)
![image-20210703130819568](https://user-images.githubusercontent.com/16440619/124344016-8e471300-dc02-11eb-9905-49ef8ba24d28.png)

Allele-specific single-end sorting report
=========================================
Read alignments processed in total:             195953
Reads were unassignable:                        195953 (100.00%)
Reads were specific for genome 1:               0 (0.00%)
Reads were specific for genome 2:               0 (0.00%)
Reads contained conflicting SNP information:    0 (0.00)

Sorting finished successfully

SNPsplit processing finished... [Allele-tagging + tag2sort]

I have checked my bam and snp file, they seems right. And I'm confident about my mouse genotype. Here is an example from igv:

And here are sample bam file and it's index: example.zip

Any ideas why? Thank you very much for taking the time to help！

FelixKrueger commented 3 years ago

Hi @skelviper

I think I know what is going on. The short answer is that you seem to have used the wrong genome, not the N-masked one but instead a full_sequence genome (which does not have any N-s introduced):

CAST_EiJ_C57BL_6NJ_dual_hybrid.based_on_GRCm38_full_sequence

The reference genome is already C57BL_6 of some description, so I would probably not go for a dual hybrid genome, but simply make a B6/CAST hybrid genome like so:

SNPsplit_genome_preparation --vcf_file mgp.v5.merged.snps_all.dbSNP142.vcf --reference_genome ../raw_data/ --strain CAST_EiJ

The genome you will want to index is then in the folder CAST_EiJ_N-masked (stay clear of the folder called full_sequence).

I just went ahead and extracted the FastQ sequences from the BAM file you provided, and ran it through a quick HISAT2 pipeline against a CAST/B6 (single-) hybrid genome over here. The data is indeed nicely allele-specific:

Input file:                                             'test_CAST_EiJ_N-masked_GRCm38_hisat2.allele_flagged.bam'
Writing SNPsplit-sort report to:                        'test_CAST_EiJ_N-masked_GRCm38_hisat2.SNPsplit_sort.txt'
Writing unassigned reads to:                            'test_CAST_EiJ_N-masked_GRCm38_hisat2.unassigned.bam'
Writing genome 1-specific reads to:                     'test_CAST_EiJ_N-masked_GRCm38_hisat2.genome1.bam'
Writing genome 2-specific reads to:                     'test_CAST_EiJ_N-masked_GRCm38_hisat2.genome2.bam'

Allele-specific single-end sorting report
=========================================
Read alignments processed in total:             171432
Reads were unassignable:                        83122 (48.49%)
Reads were specific for genome 1:               47520 (27.72%)
Reads were specific for genome 2:               39596 (23.10%)
Reads contained conflicting SNP information:    1194 (0.70)

As a quick note, if you are using STAR in end-to-end mode, you will probably want to trim the input FastQ file first for a better mapping efficiency. Also, since v0.4.0 SNPsplit should also support soft-clipped reads, so you might want to use that instead.

skelviper commented 3 years ago

Hi @skelviper

I think I know what is going on. The short answer is that you seem to have used the wrong genome, not the N-masked one but instead a full_sequence genome (which does not have any N-s introduced):
CAST_EiJ_C57BL_6NJ_dual_hybrid.based_on_GRCm38_full_sequence
The reference genome is already C57BL_6 of some description, so I would probably not go for a dual hybrid genome, but simply make a B6/CAST hybrid genome like so:
SNPsplit_genome_preparation --vcf_file mgp.v5.merged.snps_all.dbSNP142.vcf --reference_genome ../raw_data/ --strain CAST_EiJ
The genome you will want to index is then in the folder CAST_EiJ_N-masked (stay clear of the folder called full_sequence).

I just went ahead and extracted the FastQ sequences from the BAM file you provided, and ran it through a quick HISAT2 pipeline against a CAST/B6 (single-) hybrid genome over here. The data is indeed nicely allele-specific:
Input file:                                             'test_CAST_EiJ_N-masked_GRCm38_hisat2.allele_flagged.bam'
Writing SNPsplit-sort report to:                        'test_CAST_EiJ_N-masked_GRCm38_hisat2.SNPsplit_sort.txt'
Writing unassigned reads to:                            'test_CAST_EiJ_N-masked_GRCm38_hisat2.unassigned.bam'
Writing genome 1-specific reads to:                     'test_CAST_EiJ_N-masked_GRCm38_hisat2.genome1.bam'
Writing genome 2-specific reads to:                     'test_CAST_EiJ_N-masked_GRCm38_hisat2.genome2.bam'

Allele-specific single-end sorting report
=========================================
Read alignments processed in total:             171432
Reads were unassignable:                        83122 (48.49%)
Reads were specific for genome 1:               47520 (27.72%)
Reads were specific for genome 2:               39596 (23.10%)
Reads contained conflicting SNP information:    1194 (0.70)
As a quick note, if you are using STAR in end-to-end mode, you will probably want to trim the input FastQ file first for a better mapping efficiency. Also, since v0.4.0 SNPsplit should also support soft-clipped reads, so you might want to use that instead.

Thank you very much! I will test it and if there are other problems I will reopen this issue.

FelixKrueger / SNPsplit

Why all my reads are unassigned #49