Question about allele_flagged.bam and genome1.bam/genome2.bam files

xixueyu96 commented 5 months ago

Hi Felix,

Thank you for your work on this project. I have a question regarding the different BAM files generated by the SNPsplit, specifically the allele_flagged.bam file and the genome1.bam/genome2.bam files.

Could you please clarify the following:

What are the main differences between the allele_flagged.bam file and the genome1.bam/genome2.bam files?
Are the reads in allele_flagged.bam essentially a combination of the reads from genome1.bam and genome2.bam?

Additionally, I have noticed that there are some reads in allele_flagged.bam that appear to have the XX tag with the value G1 or G2, but these reads do not seem to be present in either genome1.bam or genome2.bam. Could you explain why this is the case?

Thanks,

Sherry

FelixKrueger commented 5 months ago

1) The allele_flagged file is more or less equivalent to the BAM used as input, but it carries an additional tag that simply states whether a read can be assigned to a specific allele, is unassignable, or even conflicting (see here). The genome1/genome2 files are the outcome of the sorting process (see more here)

2) if you have paired-end reads where read1 appears to be specific for genome 1, but read2 is specific for genome 2, the read pair would be classified as 'conflicting', and not get written out by default.

xixueyu96 commented 5 months ago

Hi Felix,

Thank you so much for your prompt and detailed response to my question. I realized that I had some misunderstandings regarding the CF tag. I initially thought that in the case of paired-end reads where the two reads are marked as coming from different parents, both reads would carry the CF tag. I now understand that these reads will be filtered divided into genome1.bam and genome2.bam.

Thanks again for your help. I will close this issue now.

Sherry

FelixKrueger / SNPsplit

Question about allele_flagged.bam and genome1.bam/genome2.bam files #82