Phased human genomes - Githubissues

LRizzardi1 commented 6 years ago

I was wondering how SNPsplit would handle phased human genomes. I didn't see anything in the documentation that seemed to take it into account but I could've missed it. Thanks!

FelixKrueger commented 6 years ago

Hi Lindsay,

If you have phased SNP information for human data, SNPsplit should work in pretty much the same way. Since are currently two ways of getting of arriving at the point where you can align the data to the N-masked genome, use SNPsplit and then carry on with your downstream analyses:

You could modify the SNPsplit_genome_preparation script to work with your VCF file. This would probably require changes in a few places, but I have done this for phased human data myself before. If the chromosomes are not mentioned in the same way as they are fin the mouse genomes VCF files, I believe one main part was that you need to change the chromosomes it using to:

    # HUMAN GENOME 
    @chroms = (1..22,'X','Y','MT');

I would be happy to get this to work for you if you could supply a copy of your VCF file (because they all look different...). This option would have the advantage that it will generate an N-masked genome as well as the SNP file which is required later on for the SNPsplit processing itself.

The other option is that you prepare N-masked genome as well as the SNP file for SNPsplit in some other way yourself (and I am afraid you would be on your own there).

Once the N-masked genome is was generated, you can:

index it with either Bismark, Bowtie2, HISAT2 etc.,
run your alignments, and then
use SNPsplit on the resulting BAM file(s).
In case of bisulfite data you would have to run deduplicate_bismark and then
the bismark_methylation_extractor after they have been split up by SNPsplit

Just as a comment, the information of the phased genome is preserved in a way, because the SNP given as REF will be used as Genome 1, and ALT is used as Genome 2. I hope this helps?! Felix

cloudred20 commented 5 years ago

Hi, I'm studing allele-specific methylation in human cancer cell lines and would like to prepare hg19 using SNPsplit_genome_preparation and following VCF file, ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/00-All.vcf.gz.

What are the major changes I will have to make in the SNPsplit_genome_preparation script?
Also, the Bismark user guide mentions that deduplication is not recommended for RRBS. But as you've commented above, is it necessary when using SNPsplit?

FelixKrueger commented 5 years ago

Hi @megha20 ,

To 1: I downloaded the VCF file you linked, and took a quick look. It would appear that this is a list of all SNPS annotated in dbSNP? Before you get started with this endeavour, I would like to a make sure that a few points are clear. SNPsplit is not a generalised SNP tool that will work for all situations, but is rather meant to discriminate files if both parental genotypes are known. You could possibly still use all dbSNP SNPs (as long as they are clearly defined), and look at cancer cell lines. Since the genomes are not phased, the only thing you could look at would be allelic imbalance in expression, ChIP-Seq binding or whatever you want to look at), but you can't truly assign reads to parental alleles.

If you wanted to go ahead with this, there are good and bad news. The good news are that you don't really have to deal with strains and so on, but you are kind of interested in all of the SNPs. This is however also a big problem, as the file you linked has more than 320 million lines! Since this has to be held in memory, such an 'all-dbSNP' approach would consume a HUGE amount of RAM (probably more than 100GB).

You could either change the entire code that looks for high confidence SNPs in the VCF file or write a new script that will simply write out every SNP that has a single REF and a single ALT base into a folder called SNPs_<Strain_name>, and then use the option --skip_filtering:

--skip_filtering              This option skips reading and filtering the VCF file. This assumes that a folder named
                              'SNPs_<Strain_Name>' exists in the working directory, and that text files with SNP information
                              are contained therein in the following format:

                                          SNP-ID     Chromosome  Position    Strand   Ref/SNP
                              example:   33941939        9       68878541       1       T/G

Regarding 2:

Unless you have used a UMI approach for the RRBS, it is indeed recommended not to deduplicate. SNPsplit itself doesn't really care about what you feed it with.

I hope this helps, Best, Felix

hmyh1202 commented 5 years ago

Hi Lindsay,

If you have phased SNP information for human data, SNPsplit should work in pretty much the same way. Since are currently two ways of getting of arriving at the point where you can align the data to the N-masked genome, use SNPsplit and then carry on with your downstream analyses:
1. You could modify the SNPsplit_genome_preparation script to work with your VCF file. This would probably require changes in a few places, but I have done this for phased human data myself before. If the chromosomes are not mentioned in the same way as they are fin the mouse genomes VCF files, I believe one main part was that you need to change the chromosomes it using to:
    # HUMAN GENOME 
    @chroms = (1..22,'X','Y','MT');
I would be happy to get this to work for you if you could supply a copy of your VCF file (because they all look different...). This option would have the advantage that it will generate an N-masked genome as well as the SNP file which is required later on for the SNPsplit processing itself.
1. The other option is that you prepare N-masked genome as well as the SNP file for SNPsplit in some other way yourself (and I am afraid you would be on your own there).
Once the N-masked genome is was generated, you can:
1. index it with either Bismark, Bowtie2, HISAT2 etc.,

2. run your alignments, and then

3. use SNPsplit on the resulting BAM file(s).

4. In case of bisulfite data you would have to run `deduplicate_bismark` and then

5. the `bismark_methylation_extractor` _after_ they have been split up by SNPsplit
Just as a comment, the information of the phased genome is preserved in a way, because the SNP given as REF will be used as Genome 1, and ALT is used as Genome 2. I hope this helps?! Felix

nice

FelixKrueger / SNPsplit

Phased human genomes #22