frazer-lab / i2QTL-SV-STR-analysis

Code used to process and analyze structural variants and short tandem repeat variants profiled in 719 deeply sequenced whole genomes as part of the i2QTL consortium. This data set consists of sequencing data from the iPSCORE (Frazer lab) and HipSci projects.
11 stars 1 forks source link

GenomeSTRiP stitching query #2

Open emmakwiener opened 4 years ago

emmakwiener commented 4 years ago

Hi, I am a Human Genetics PhD student in South Africa and we are doing a project looking at CNVs in Sub-Saharan African individuals. We have run GenomeSTRiP on ~1000 samples and looking at the outputs noticed the splitting of CNVs. In looking at the literature we found your paper and accompanying code, that thoroughly addresses and solves this problem, so thank you so much for this work. We ran Genome STRiP on the entire cohort and so have a single VCF output, so we were wondering if it would be possible to use just the stitching portion of your code, or if the other scripts are prerequisites to this. Thanks so much for your assistance Regards Emma

djakubosky commented 4 years ago

Hi Emma, Thanks so much for your kind words! I'm happy to help find a way to adapt my code for stitching to suit your application. As it stands there are several steps you'd have to run to utilize my stitching method (https://github.com/frazer-lab/i2QTL-SV-STR-analysis/blob/master/scripts/genome_strip_stitch_v4.py)- ahead of which I extract information from the VCF, annotate the variants, and save separate tables for variant, genotype information, and quality annotations (LQ tag matrix). The pipeline is also sex aware and relies on a file mapping sample ID and gender to do stitching more accurately on the X/Y chromosomes. Notably- I ran stitching after also combining variant calls from multiple cohorts - which complicated matters further- since HipSci samples did not have Y chromosome data. In my case I chose to identify sites that needed to be stitched - then genotype them using the SVGenotyper module in Genome STRiP - so the output of the pipeline is an info file about clusters of variants that need to be stitched - and a VCF file (with no genotypes) with these sites that served as input to SVGenotyper - this portion of my code relies on files that aren't on my github. Do you plan to do also do a "Re-Genotyping" step after stitching?

If you are willing to wait till early next week- I think it could be straightforward to adapt this command line tool so that there are less dependencies/ and will be more generally useful to others in the community.

If you are interested in a more simplistic approach to the problem please also visit this repo from our previous iPSC eQTL project. You can look at this notebook which describes a similar CNV processing approach for Genome Strip - however instead of re-genotyping sites that likely needed stitching - we took the average copy number across CNVs that were being merged. It might be more straightforward for you to adapt the code in that notebook for your purposes if you are in a crunch for time. This method works fairly well - but has the caveat of not limiting distance between CNV being merged, not handling stitching rare CNVs as well as my newer approach- which requires correlation between non-mode copy number samples with added stringency in very rare cases, and not handling the X/Y chromosomes in any special way.

Let me know what you think, Best, David

emmakwiener commented 4 years ago

Hi David, Thank you so much for your response and willingness to assist us with this. I haven't been able to discuss these options yet with my colleagues and we don't have a hard deadline in terms of time, so next week would be fine. My initial thoughts are that the simpler solution would mainly work for us, as we are not focusing on ultra rare variants rather on more common variants, and not looking at X and Y just yet, but I do feel including the distance between CNVs is a great addition. I will look at the CNV processing notebook with my colleagues tomorrow, and then get back to you. Thank you so much again for your willingness to help Regards Emma

On Tue, Sep 1, 2020 at 7:17 PM David Jakubosky notifications@github.com wrote:

Hi Emma, Thanks so much for your kind words! I'm happy to help find a way to adapt my code for stitching to suit your application. As it stands there are several steps you'd have to run to utilize my stitching method ( https://github.com/frazer-lab/i2QTL-SV-STR-analysis/blob/master/scripts/genome_strip_stitch_v4.py)- ahead of which I extract information from the VCF, annotate the variants, and save separate tables for variant, genotype information, and quality annotations (LQ tag matrix). The pipeline is also sex aware and relies on a file mapping sample ID and gender to do stitching more accurately on the X/Y chromosomes. Notably- I ran stitching after also combining variant calls from multiple cohorts - which complicated matters further- since HipSci samples did not have Y chromosome data. In my case I chose to identify sites that needed to be stitched - then genotype them using the SVGenotyper module in Genome STRiP - so the output of the pipeline is an info file about clusters of variants that need to be stitched - and a VCF file (with no genotypes) with these sites that served as input to SVGenotyper - this portion of my code relies on files that aren't on my github. Do you plan to do also do a "Re-Genotyping" step after stitching?

If you are willing to wait till early next week- I think it could be straightforward to adapt this command line tool so that there are less dependencies/ and will be more generally useful to others in the community.

If you are interested in a more simplistic approach to the problem please also visit this repo https://github.com/frazer-lab/cardips-ipsc-eqtl from our previous iPSC eQTL project. You can look at this notebook https://github.com/frazer-lab/cardips-ipsc-eqtl/blob/master/notebooks/CNV%20Processing.ipynb which describes a similar CNV processing approach for Genome Strip - however instead of re-genotyping sites that likely needed stitching - we took the average copy number across CNVs that were being merged. It might be more straightforward for you to adapt the code in that notebook for your purposes if you are in a crunch for time. This method works fairly well - but has the caveat of not limiting distance between CNV being merged, not handling stitching rare CNVs as well as my newer approach- which requires correlation between non-mode copy number samples with added stringency in very rare cases, and not handling the X/Y chromosomes in any special way.

Let me know what you think, Best, David

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/frazer-lab/i2QTL-SV-STR-analysis/issues/2#issuecomment-685011183, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIPRNIULH3MRAMUCEQCRUB3SDUUDVANCNFSM4QRSSUHQ .

-- Emma Wiener, MBBCh

PhD Student Division of Human Genetics University of Witwatersrand

emmakwiener commented 4 years ago

Hi David, I had a discussion with my supervisors (Prof Zane Lombard and Prof Scott Hazelhurst) and colleagues on the project and we would really like to use your newer stitching code if possible. Would it be possible to set up zoom call next week to discuss what adaptions would need to be made? Thanks so much Regards Emma

djakubosky commented 4 years ago

Hi Emma, No problem, I'm happy to share what I can about postprocessing methods and help to adapt my code to be useful. Moreover, I would be willing to do a zoom call with you/and or your colleagues to discuss further. Send me an email at djakubos@ucsd.edu and we can schedule something.

Regards, David