FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
51 stars 19 forks source link

Genome preparation:129S1 (paternal) mice × B6D2F1/J (F1 of C57BL6NJ × DBA2J, maternal) mice #71

Closed lelesama closed 1 year ago

lelesama commented 1 year ago

Hello,

I have embryos from 129S1 (paternal) mice × B6D2F1/J (F1 of C57BL6NJ × DBA2J, maternal) mice. How can I make the genome preparation? Is SNPsplit_genome_preparation --vcf mgp.v5.merged.snps_all.dbSNP142.vcf.gz --reference /.../Genomes/Mouse/GRCm38/--strain DBA2J --strain2 129S1 ok ?

FelixKrueger commented 1 year ago

Hmm, this is a tricky situation as the splitting logic really only works if you have a clear two-genome hybrid.

I remember having had such a case in the past, but it required a multistep process that was somewhat convoluted (see this paper: https://pubmed.ncbi.nlm.nih.gov/31665063/)

I'll paste here the relevant text from the materials and methods, I hope it helps giving you some inspiration:

In brief, sequencing reads were mapped to a Mus musculus (GRCm38)-derived genome, where SNPs between hybrid strains (C57BL6 and CAST/EiJ, or FvB and CAST/EiJ, or C57BL6 and PWK) had been masked by the ambiguity nucleobase N (N-masked genome). Aligned reads were then sorted into one of three BAM files: C57BL6 (genome 1), CAST/EiJ (genome 2), or unassigned. The females carrying conditional Dnmt3a/Dnmt3b double knockout (matDKO) were predominantly C57BL6; however, there was approximately 15% of 129 alleles remaining in the strain. Therefore, these data were run through a unique pipeline to allow for a complete mapping of the maternal allele. All datasets generated from matDKO/CAST embryos were first aligned to a B6/CAST N-masked genome, as above, but with the difference that all SNPs which were in common between 129 and CAST (~ 2 million) had been excluded. The data was then split against C57BL6 and CAST/EiJ, as above. FastQ reads of the unassigned fraction of reads were then recovered from the original FastQ files, and in the second step, these reads were then aligned to a 129S1/CAST genome (generated with SNPsplit_genome_preparation). Alignments were then SNPplit between 129 and CAST/EiJ, and the 129-specific reads were then combined with the C57BL6-specific reads (from step 1) to comprise a complete maternal allelic set. Raw sequencing reads and allelically mapped BAM files have been deposited into the Gene Expression Omnibus database (GSE124216

lelesama commented 1 year ago

Thank you for reply, as the paper mentioned above, FastQ reads of the unassigned fraction of reads were then recovered from the original FastQ files using the code but with error : bedtools bamtofastq -i trimmed_bismark_bt2.unassigned.name.bam -fq rerun.fq.gz

Could you give me a hint about recovering the unassigned fraction of reads in bam to fastq ?

FelixKrueger commented 1 year ago

I think you've got two options here:

  1. You could write a script that goes through the unassigned BAM file, and stores all of the readIDs in a data structure. Next, the script could run a single pass through your FastQ file or files (if paired-end), and also extracts the readID. If the readID was present in data structure, you can write the read out into a new FastQ file.

  2. you could potentially also use one of the many bam-to-fastq tools out there to achieve something similar, e.g. with

samtools fastq unassigned.bam > unassigned.fastq

I am not quite sure how this would handle paired-end data, but it might be worth a shot.

ddepierre commented 2 months ago

Hi @FelixKrueger and @lelesama,

I am currently facing the exact same issue, with a maternal strain being the F1 of B6 x CBA and a paternal PWK.

Can I skip the realignment step if my alignment reference genome already N-masks all the SNP possibility between my 3 potential strains of origin? Then I could directly feed SNPsplit for a second run with the unassigned.bam file?

Alternaltively, can we trick SNPsplit by modifying the SNP_table.txt.gz by adding an extra line for SNP that differ between B6 and CBA in my case? like:

   ID      Chr   Position  SNP value    Ref/SNP
18819008     5  48794752      1       C/T 
40491905    11  63643453      1       A/G 
44326884    12  96627819      1       T/A 
44326884    12  96627819      1       G/A 

With the SNP id 44326884 being either T/A when it is B6/PWK or G/A when it is CBA/PWK.

Will SNPsplit assigned the reads correctly to genome1 if it has T or G ?

Thanks! David

FelixKrueger commented 1 month ago

First of all, apologies for the late reply. If your alignment was performed against a genome that had all positions N-mkased which are different between the three different genomes, you can probably make something work with the sorting logic.

I am afraid in its current form SNPsplit won't let itself get tricked into handling a 3 way split though. In the case you describe where you've got several lines lines for the same position, the last one will over-write previous ones...

ddepierre commented 1 month ago

Hi Felix, Thanks for your answer, I generated a SNP vcf file combining PWK and CBA and removing SNP when they were not able to properly distinguish between my mat and pat genome and it seems to work. Basically I filtered out case2 and kept case1 in the following example:

--------1---2---
B6------A---A--- (mat)
CBA-----A---T--- (mat)
PWK-----G---T--- (pat)

It reduced conflicting reads from 150000 to 100000 reads and increased properly assigned reads to maternal genome from about 13.5M to 14.8M reads for example.

FelixKrueger commented 1 month ago

Great to hear you improved the results. It will still be good to keep in mind that this isn't totally ideal, but hopefully it can help you find answers to your questions!