FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
52 stars 20 forks source link

set "--dual_hybrid" did not mask reference #26

Closed skytguuu closed 4 years ago

skytguuu commented 4 years ago

Hi,

Recently, I found when I want to set two strains of mouse background (use "--dual_hybrid"), it had failed to mask the reference genome. Followed is my command: ######## ./SNPsplit_genome_preparation_v2VCF --vcf_file ~/Annotation/SNP/mouse/mm9/mgp.v2.snps.annot.reformat.vcf --reference_genome ~/Annotation/SNP/mouse/mm9/mm9 --strain C57BL6NJ --strain2 PWKPhJ --dual_hybrid --genome_build mm9 --nmasking ######## I didn't know what's the problem? Could you give me some advises?

Thanks, Best, Garen

FelixKrueger commented 4 years ago

Hi Garen,

SNPsplit was designed for the current genome build (mm10 or GRCm38) and not the old mm9 build. Couldn't you simply use the current genome for your studies?

In any case, it would however be helpful to post the actual problems or errors you are running into (what do you see on screen?) so we could try to tackle the issue. I am currently travelling in the Rocky Mountains and will probably not be able to reply again before the beginning of September I am afraid.

All the best, Felix

skytguuu commented 4 years ago

Hi Felix,

I hope you have a good holiday. Because other data I used are all mapped with mm9, to make the consistence, I need mm9 mapping. Actually, when I used SNPsplit to prepare masked reference genome, it worked well. But when I try to SNPsplit the mapped reads, it also worked well and no error reports. However, the reads it split were 0%. And lastly, I found the reason that it fail was because of non-masked reference genome.The command and report are followed: ######### SNPsplit --snp_file C57BL6NJ_PWKPhJ_SNPs_in_common.mm9.txt cell8_rep1.bam --paired ######### Reading from sorted mapping file '/home/dell/ncbi/public/sra/RNA_clean/process/cell8_rep1.sortedByName.bam' Processed 1000000 lines so far Processed 2000000 lines so far Processed 3000000 lines so far Processed 4000000 lines so far Processed 5000000 lines so far Processed 6000000 lines so far Processed 7000000 lines so far Processed 8000000 lines so far Processed 9000000 lines so far Processed 10000000 lines so far Processed 11000000 lines so far Processed 12000000 lines so far Processed 13000000 lines so far Processed 14000000 lines so far Processed 15000000 lines so far Processed 16000000 lines so far Processed 17000000 lines so far Processed 18000000 lines so far Processed 19000000 lines so far

Allele-tagging report

Processed 19609142 read alignments in total Reads were unaligned and hence skipped: 4387388 (22.37%) 15192337 reads were unassignable (77.48%) 447 reads were specific for genome 1 (0.00%) 631 reads were specific for genome 2 (0.00%) 7554527 reads did not contain one of the expected bases at known SNP positions (38.53%) 28339 contained conflicting allele-specific SNPs (0.14%)

SNP coverage report

SNP annotation file: /home/dell/Annotation/SNP/mouse/mm9/test/C57BL6NJ_PWKPhJ_SNPs_in_common.mm9.txt SNPs stored in total: 7813 N-containing reads: 7555606 non-N: 7637810 total: 19609142 Reads had a deletion of the N-masked position (and were thus called Conflicting): 28338 (0.14%) Of which had multiple deletions of N-masked positions within the same read: 4 (0.00%)

Of valid N containing reads, N was present in the list of known SNPs: 1105 (0.01%) N was not present in the list of SNPs: 15374611 (99.99%)

Finished allele-tagging for file <<< '/home/dell/ncbi/public/sra/RNA_clean/process/cell8_rep1.sortedByName.bam' >>>

Summary of parameters for SNPsplit-sort:

SNPsplit tagged infile: cell8_rep1.allele_flagged.bam Output directory: >< Parent directory: >/home/dell/Annotation/SNP/mouse/mm9/test< Samtools path: /home/dell/softwares/anaconda2/bin/samtools Output format: BAM (default) Input format: Paired-end

Input file: 'cell8_rep1.allele_flagged.bam' Writing SNPsplit-sort report to: 'cell8_rep1.SNPsplit_sort.txt' Writing unassigned reads to: 'cell8_rep1.unassigned.bam' Writing genome 1-specific reads to: 'cell8_rep1.genome1.bam' Writing genome 2-specific reads to: 'cell8_rep1.genome2.bam' Processed 1000000 lines so far Processed 2000000 lines so far Processed 3000000 lines so far Processed 4000000 lines so far Processed 5000000 lines so far Processed 6000000 lines so far Processed 7000000 lines so far Processed 8000000 lines so far Processed 9000000 lines so far

Allele-specific paired-end sorting report

Read pairs/singletons processed in total: 9162458 thereof were read pairs: 6059296 thereof were singletons: 3103162 Reads were unassignable (not overlapping SNPs): 9133396 (99.68%) thereof were read pairs: 6036055 thereof were singletons: 3097341 Reads were specific for genome 1: 422 (0.00%) thereof were read pairs: 310 thereof were singletons: 112 Reads were specific for genome 2: 596 (0.01%) thereof were read pairs: 466 thereof were singletons: 130 Reads contained conflicting SNP information: 28044 (0.31%) thereof were read pairs: 22465 thereof were singletons: 5579

Sorting finished successfully

Appending sorting report to tagging report... SNPsplit processing finished... ############################# Hence, I am confused of this problem? Could you give me a hand?

Thank you ! Best, Garen

FelixKrueger commented 4 years ago

Hi Garen,

It looks like you are feeding the wrong annotation file to SNPsplit. You will want to use the SNP positions that are in different between the two strains, the file should have a name like all_PWK_FVB_SNPs_C57BL6_NJ_reference.based_on_mm9.txt or the like. By the way, I think the Black6 NJ sub-strain is so similar to the reference sequence that it is probably not worth creating a dual hybrid with it. I am just running a test of C57BL6_NJ on its own to confirm this. Let me know how you get on.

Cheers, Felix

skytguuu commented 4 years ago

Hi Felix, Thanks for your help. I tried to use "all_SNPs_PWKPhJ_GRCm38.txt.gz", but the result has less reads to split. And you are right, Black6 is similar to the reference genome. So, I used only one SNP genome to solve it. But I am just curious about the "dual_hybrid" model, why it failed?

Thanks, Best, Garen

FelixKrueger commented 4 years ago

Hi Garen,

I have run the dual genome preparation here as a test. Here are the results for C57BL_6NJ:

SNP position summary for strain C57BL_6NJ (based on mouse genome build GRCm38)
===========================================================================

Positions read in total:        78772544

21791   SNP were homozygous. Of these:
14507   SNP were homozygous and passed high confidence filters and were thus included into the C57BL_6NJ genome

and PWK_PhJ:

SNP position summary for strain PWK_PhJ (based on mouse genome build GRCm38)
===========================================================================

Positions read in total:        78772544

20986995        SNP were homozygous. Of these:
20315523        SNP were homozygous and passed high confidence filters and were thus included into the PWK_PhJ genome

This confirms that C57BL_6NJ really only has some 14507 SNPs that differ compared to Black6 reference genome, whereas PWK has some 20.3M SNPs.

Out of 14507 SNPs, 7214 are shared between C57BL_6NJ and PWK_PhJ (C57BL_6NJ_PWK_PhJ_SNPs_in_common.GRCm38.txt), so these cannot be used to discriminate between the two strains. 20280499 were used to discriminate the two genomes (all_PWK_PhJ_SNPs_C57BL_6NJ_reference.based_on_GRCm38.txt)

Here is a more detailed breakdown:

Looked at positions from new Reference strain [C57BL_6NJ]:              14507
Compared positions from new SNP strain [PWK_PhJ]:               20315523
======================================================
SNPs were the same in Ref and SNP genome (not written out):     7214
SNPs were present in both Ref and SNP genome but had a different sequence:      84
SNPs were low confidence in one strain and thus ignored:        35019
SNPs were unique to Ref [C57BL_6NJ]:                            3068
SNPs were unique to SNP [PWK_PhJ]:                              20277347

I have just sampled 5 million reads from both C57BL_6NJ_full_sequence and PWK_PhJ_full_sequence, and confirmed with reStrainingOrder (https://github.com/FelixKrueger/reStrainingOrder) that C57BL_6NJ or PWK_PhJ come out as the top hits for the mouse strain screening process (with >99.5%).

So all in all, I can't see a reason why the dual strain approach would not work as expected. Maybe you could send me a few reads of your sample and I run it over here to what the allelic split is? I could provide you with an FTP server to put some FastQ files on (or just send some 500K reads as email attachment).

Best, Felix

skytguuu commented 4 years ago

Dear Felix, I am appreciated for your carefully check. Actually, I used the program in the file of "outdated_VCF_versions". Because the vcf file I downloaded is annotated with mm9. So, I think the problem may be the version? Am I right?

Thanks. Best, Garen

FelixKrueger commented 4 years ago

Hi Garen,

I just ran the same process for the outdated genome, and even though this is quite old the stats looked very similar:

Looked at positions from new Reference strain [C57BL6NJ]:               16373
Compared positions from new SNP strain [PWKPhJ]:                20180022
======================================================
SNPs were the same in Ref and SNP genome (not written out):     7813
SNPs were present in both Ref and SNP genome but had a different sequence:      79
SNPs were low confidence in one strain and thus ignored:        22591
SNPs were unique to Ref [C57BL6NJ]:                             3846
SNPs were unique to SNP [PWKPhJ]:                               20154174

The command line was: SNPsplit/outdated_VCF_versions/SNPsplit_genome_preparation_v2VCF --vcf mgp.v2.snps.annot.reformat.vcf.gz --strain C57BL6NJ --strain2 PWKPhJ --reference /bi/scratch/Genomes/Mouse/NCBIM37/

skytguuu commented 4 years ago

Dear Felix,

Thanks for your help. When I tried to use mgp.v2.snps.annot.reformat.vcf.gz file, it reported an error that: "[FATAL ERROR] Please ensure that the same version of the genome is used for both VCF annotations and reference genome (FastA files). Exiting...". So, I checked the reference genome and found that the chromosome name is "chr*" in the VCF, however, the fasta of mm9 genome I downloaded from UCSC genome browser is only the number. After, I revised the name of chromosome, I got the same result with yours.

The final split reads result is as followed:

Allele-specific paired-end sorting report

Read pairs/singletons processed in total: 5745097 thereof were read pairs: 3789788 thereof were singletons: 1955309 Reads were unassignable (not overlapping SNPs): 4126376 (71.82%) thereof were read pairs: 2598761 thereof were singletons: 1527615 Reads were specific for genome 1: 823625 (14.34%) thereof were read pairs: 615296 thereof were singletons: 208329 Reads were specific for genome 2: 752886 (13.10%) thereof were read pairs: 542518 thereof were singletons: 210368 Reads contained conflicting SNP information: 42210 (0.73%) thereof were read pairs: 33213 thereof were singletons: 8997

I am not sure those 10% split reads are good or not?

Thanks, Best, Garen

FelixKrueger commented 4 years ago

The percentage of reads that can be assigned allele-specifically depends on several factors: 1) Read length: Broadly speaking, the longer your reads the higher the chance of hitting a SNP. 2) The read type: Paired-end reads typically have a higher chance to be assigned allele-specifically than single-end reads. 3) The library type: Certain library types are easier to assign allele-specifically than others. WGS should best, followed by ChIP/ATAC-seq, RNA-seq is less good as it focusses on transcribed regions only (which have fewer SNPs). Worse still are bisulfite libraries as SNP positions involving C/Ts can only be used in special circumstances.

If you are looking closely, then your "10% split reads" are nearly 28% (added together), so I would say this is a good start!

I think the highest we have seen is ~25-30% for each genome for 2x125bp paired-end ChIP-seq libraries. Does this help?

FelixKrueger commented 4 years ago

Oh and I forgot, I think with the latest genome, the latest annotation, and the most up-to-date version of the SNPsplit genome preparation you might be able to squeeze out another percent or two...

skytguuu commented 4 years ago

Dear Felix,

I am appreciated for your clear explanation. It makes me sense.

Thanks, Best, Garen

FelixKrueger commented 4 years ago

Excellent. Best of luck!¬