Closed skelviper closed 3 years ago
Hi @skelviper
I think I know what is going on. The short answer is that you seem to have used the wrong genome, not the N-masked one but instead a full_sequence genome (which does not have any N-s introduced):
CAST_EiJ_C57BL_6NJ_dual_hybrid.based_on_GRCm38_full_sequence
The reference genome is already C57BL_6 of some description, so I would probably not go for a dual hybrid genome, but simply make a B6/CAST hybrid genome like so:
SNPsplit_genome_preparation --vcf_file mgp.v5.merged.snps_all.dbSNP142.vcf --reference_genome ../raw_data/ --strain CAST_EiJ
The genome you will want to index is then in the folder CAST_EiJ_N-masked
(stay clear of the folder called full_sequence
).
I just went ahead and extracted the FastQ sequences from the BAM file you provided, and ran it through a quick HISAT2 pipeline against a CAST/B6 (single-) hybrid genome over here. The data is indeed nicely allele-specific:
Input file: 'test_CAST_EiJ_N-masked_GRCm38_hisat2.allele_flagged.bam'
Writing SNPsplit-sort report to: 'test_CAST_EiJ_N-masked_GRCm38_hisat2.SNPsplit_sort.txt'
Writing unassigned reads to: 'test_CAST_EiJ_N-masked_GRCm38_hisat2.unassigned.bam'
Writing genome 1-specific reads to: 'test_CAST_EiJ_N-masked_GRCm38_hisat2.genome1.bam'
Writing genome 2-specific reads to: 'test_CAST_EiJ_N-masked_GRCm38_hisat2.genome2.bam'
Allele-specific single-end sorting report
=========================================
Read alignments processed in total: 171432
Reads were unassignable: 83122 (48.49%)
Reads were specific for genome 1: 47520 (27.72%)
Reads were specific for genome 2: 39596 (23.10%)
Reads contained conflicting SNP information: 1194 (0.70)
As a quick note, if you are using STAR in end-to-end mode, you will probably want to trim the input FastQ file first for a better mapping efficiency. Also, since v0.4.0 SNPsplit should also support soft-clipped reads, so you might want to use that instead.
Hi @skelviper
I think I know what is going on. The short answer is that you seem to have used the wrong genome, not the N-masked one but instead a full_sequence genome (which does not have any N-s introduced):
CAST_EiJ_C57BL_6NJ_dual_hybrid.based_on_GRCm38_full_sequence
The reference genome is already C57BL_6 of some description, so I would probably not go for a dual hybrid genome, but simply make a B6/CAST hybrid genome like so:
SNPsplit_genome_preparation --vcf_file mgp.v5.merged.snps_all.dbSNP142.vcf --reference_genome ../raw_data/ --strain CAST_EiJ
The genome you will want to index is then in the folder
CAST_EiJ_N-masked
(stay clear of the folder calledfull_sequence
).I just went ahead and extracted the FastQ sequences from the BAM file you provided, and ran it through a quick HISAT2 pipeline against a CAST/B6 (single-) hybrid genome over here. The data is indeed nicely allele-specific:
Input file: 'test_CAST_EiJ_N-masked_GRCm38_hisat2.allele_flagged.bam' Writing SNPsplit-sort report to: 'test_CAST_EiJ_N-masked_GRCm38_hisat2.SNPsplit_sort.txt' Writing unassigned reads to: 'test_CAST_EiJ_N-masked_GRCm38_hisat2.unassigned.bam' Writing genome 1-specific reads to: 'test_CAST_EiJ_N-masked_GRCm38_hisat2.genome1.bam' Writing genome 2-specific reads to: 'test_CAST_EiJ_N-masked_GRCm38_hisat2.genome2.bam' Allele-specific single-end sorting report ========================================= Read alignments processed in total: 171432 Reads were unassignable: 83122 (48.49%) Reads were specific for genome 1: 47520 (27.72%) Reads were specific for genome 2: 39596 (23.10%) Reads contained conflicting SNP information: 1194 (0.70)
As a quick note, if you are using STAR in end-to-end mode, you will probably want to trim the input FastQ file first for a better mapping efficiency. Also, since v0.4.0 SNPsplit should also support soft-clipped reads, so you might want to use that instead.
Thank you very much! I will test it and if there are other problems I will reopen this issue.
Hi! SNPsplit seems like a perfect tool for allele specific research, nice work!
Here is my question: I hope to look at RNA expression at individual chromosome, but although my reads are processed, no read were assigned.
my code:
SNPsplit output:
I have checked my bam and snp file, they seems right. And I'm confident about my mouse genotype. Here is an example from igv:
And here are sample bam file and it's index: example.zip
Any ideas why? Thank you very much for taking the time to help!