Output genome files contain UA flags

GuidoBarzaghi commented 4 years ago

Hello!

First of all thanks for your great work with this tool, it's helping a lot my analyses! I was hoping you could help me figure out what am I doing wrong while using SNPsplit: for some reason many of the reads in the output genome1 and genome2 files are flagged as unassigned (UA). Is that suppose to be the case?

Here is my command line:

SNPsplit INPUT_FILE \ --snp_file SNP_FILE \ --paired \ --bisulfite \ --conflicting

The INPUT_FILE is the output from a bismark alignment, while the SNP_FILE is the all_STRAIN2_SNPs_STRAIN1_reference.based_on_GRCm38.txt generated during a SNPsplit_genome_preparation run.

Those are the percentages resulting from this run: Reads were unassignable (not overlapping SNPs): 49377695 (48.29%) Reads were specific for genome 1: 28090395 (27.47%) Reads were specific for genome 2: 23789145 (23.26%) Reads contained conflicting SNP information: 998916 (0.98%)

and in fact my genome1.bam file contains more than double 28090395 reads.

Is that anything you encountered before?

Thanks a lot for your support!

FelixKrueger commented 4 years ago

Hi Guido,

I am hoping that the explanation for this is really straight forward: in paired-end mode, reads are not considered individually but as a read pair (read pairs follow each other in the Bismark output file, so one line will be Read 1, the next Read 2, and so on).

A read pair can be specific to genome 1 if the R1/R2 tags are either G1/G1, G1/UA or UA/G1 , respectively. In a single-end scenario, this should indeed not occur.

So all in all, it might all be fine, wouldn't you agree?

GuidoBarzaghi commented 4 years ago

Oh I see, that makes perfect sense.

Thanks a lot!

FelixKrueger / SNPsplit

Output genome files contain UA flags #32