Closed Ulixmanna closed 10 months ago
Sorry for the late reply, I was travelling all last week, and on top the issue is marked as complete; does that mean you found the answers in the meantime?
Just generally: SNPsplit is designed to use the VCF files of the mouse genome project, which tend to have high confidence homozygous SNP calls for some 50 strains (compared to the reference strain C57BL/6). So the genome calls are typically 0/0 (same as reference), or 1/1, 2/2 for homozygous variants.
In your case, you would have to arrive at a situation where you use ovaries from breed A, and perform variation calling against sperm from breed B. The VCF feel then needs to be in the same format as the one from the mouse genomes project (e.g. only contain homozygous variants of breed B against breed A); alternatively, if you find heterozygous calls (e.g. 0/1 or 1/0) you would have to change the logic within the SNPsplit genome preparation. There are number of closed issues where people have tried a similar approach for non-mouse species, but I am afraid it is a little fiddly as it is not the original intended use case for SNPsplit.
If you wanted to use E. coli as for normalisation in CUT&TAG I don't think this needs any N-masking, but I haven't ever done this myself to be honest.
Hi, I now have 6 ovaries from cattle breed A and 1 sperm sample from breed B for whole genome resequencing and call snp using gatk (BWA comparison uses the NCBI reference genome of breed A), the process is
1) gatk HardFiltration
2)Filter the filtered passes and keep the alleles
3) snp.vcf
According to the PASS in the FILTER in column 7, add FI information in column 9 (PASS is defined as 1, FAIL is defined as 0, and the header is modified to ##FORMAT=)
My aim is to perform allele expression analysis on RNA seq of hybrid embryos produced from breed A ovaries and breed B sperm, but I encountered two problems while constructing the N mask genome:
Problem 1: When running
SNPsplit_genome_preparation --reference_genome reference --vcf snp.vcf --strain ovary1 --strain2 sperm --dual_hybrid
the-- strain is set to any of the 6 ovaries of cattle breed A (which is what I'm using now), or is it set to all 6 ovary samples for a more complete reflection of breed A's variation (and how exactly should this be accomplished)Question 2: What exactly are the two indicators (GT and FI) for SNPSplit to judge high-confidence SNPs (because all the snp loci FI in my filtered snp.vcf file are already 1, and as a result, there are still more than 400,000 loci that are low confidence being filtered out)
my vcf file: