eturro / mmseq

Haplotype, isoform and gene level expression analysis using multi-mapping RNA-seq reads
GNU General Public License v2.0
67 stars 20 forks source link

Making sample-specific transcript FASTAs through genotyping and phasing: #7

Open olneykimberly opened 8 years ago

olneykimberly commented 8 years ago

Hello, I'd like to make a custom transcriptome but this section appears to be missing on github.

eturro commented 8 years ago

Kimberly, I'll get back to you after the holidays.

On 21 Dec 2015, at 20:38, Kimberly notifications@github.com wrote:

Hello, I'd like to make a custom transcriptome but this section appears to be missing on github

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7.

eturro commented 8 years ago

Hi Kimberly

Regarding your previous question, the haploref.rb script was meant to be used with the output of polyHap2 (http://bgx.org.uk/software/mmseq), which was written by a colleague and is no longer maintained as far as I am aware.

The best way to make a custom transcriptome now is by using a modified version of mouse_strain_transcriptome.sh.

Do you have phased genotypes (SNPs and indels) on which to base the creation of your custom transcriptome already?

olneykimberly commented 8 years ago

Hello Bryce,

Thank you for getting back to me, I greatly appreciate you helping me out.

I have phased genotypes on which to base the creation of a custom transcriptome. I’m working with the 1000Genomes phase 3 data set. Now I’m not sure that I need to create a custom transcriptome since I am working with humans. Can I just use the ready to download transciptome that you have listed on the github page for human?

My end goal is to use MMSEQ for detecting allelic imbalance and I believe I need to run haploref.rb but I am confused about the inputs required. Do you have more information or even small samples on how the inputs need to be formatted to run haploref.rb? Or how I can go about making the inputs. cdna_file: default transcript FASTA file. gff_file: GFF file containing structure annotation for transcripts in cdna_file. pos_file: file containing, for each transcript, chromosome and positions of SNPs. hap_file: file containing, for each transcript, two versions, e.g. suffixed _A and _B, on separate lines with the alleles for the two haplotypes at each position listed in the pos_file (A and B respectively). I’m also confused about the output of mmseq and where I can obtain information on allelic imbalance once I’ve run through the pipeline. I apologize for not knowing more on this subject.

Thank you again. I really do appreciate your help.

Best, Kimberly

On Jan 6, 2016, at 3:34 AM, Ernest Turro notifications@github.com wrote:

Hi Kimberly

Regarding your previous question, the haploref.rb script was meant to be used with the output of polyHap2 (http://bgx.org.uk/software/mmseq http://bgx.org.uk/software/mmseq), which was written by a colleague and is no longer maintained as far as I am aware.

The best way to make a custom transcriptome now is by using a modified version of mouse_strain_transcriptome.sh.

Do you have phased genotypes (SNPs and indels) on which to base the creation of your custom transcriptome already?

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-169292452.

eturro commented 8 years ago

Hi Kimberly

I provide ready-made mouse strain-specific transcriptomes (which can be used for allelic imbalance in first-generation crosses) on the GitHub page, but of course, for humans, each individual will need his/her own distinct hybrid transcriptome derived from the phased genotypes, so the human transcriptomes I provide contain the reference haplotype only and are not suitable for analysis of allelic imbalance.

I would like to provide you with a script that creates individual-specific transcriptomes based on a VCF with phased genotypes. It shouldn't take long as I just need to make some changes to the existing mouse_strain_transcriptome.sh script. I think this will be better than using the old haploref.rb script.

Would you mind providing me with a small example VCF (say, a small part of one chromosome) with the phased genotypes and also tell me which human reference build you used? I can then used this example VCF to check that my code will work correctly for you.

Afterwards I'll explain in detail how to obtain the estimates of imbalance (it's essentially what is currently written under Reference files/Ready to download:/Mus musculus on the GitHub page but I'll add some more detailed explanation).

Ernest

On 8 Jan 2016, at 09:18, Kimberly notifications@github.com wrote:

Hello Bryce,

Thank you for getting back to me, I greatly appreciate you helping me out.

I have phased genotypes on which to base the creation of a custom transcriptome. I’m working with the 1000Genomes phase 3 data set. Now I’m not sure that I need to create a custom transcriptome since I am working with humans. Can I just use the ready to download transciptome that you have listed on the github page for human?

My end goal is to use MMSEQ for detecting allelic imbalance and I believe I need to run haploref.rb but I am confused about the inputs required. Do you have more information or even small samples on how the inputs need to be formatted to run haploref.rb? Or how I can go about making the inputs. cdna_file: default transcript FASTA file. gff_file: GFF file containing structure annotation for transcripts in cdna_file. pos_file: file containing, for each transcript, chromosome and positions of SNPs. hap_file: file containing, for each transcript, two versions, e.g. suffixed _A and _B, on separate lines with the alleles for the two haplotypes at each position listed in the pos_file (A and B respectively). I’m also confused about the output of mmseq and where I can obtain information on allelic imbalance once I’ve run through the pipeline. I apologize for not knowing more on this subject.

Thank you again. I really do appreciate your help.

Best, Kimberly

On Jan 6, 2016, at 3:34 AM, Ernest Turro notifications@github.com wrote:

Hi Kimberly

Regarding your previous question, the haploref.rb script was meant to be used with the output of polyHap2 (http://bgx.org.uk/software/mmseq http://bgx.org.uk/software/mmseq), which was written by a colleague and is no longer maintained as far as I am aware.

The best way to make a custom transcriptome now is by using a modified version of mouse_strain_transcriptome.sh.

Do you have phased genotypes (SNPs and indels) on which to base the creation of your custom transcriptome already?

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-169292452.

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-169929103.

olneykimberly commented 8 years ago

Hello,

That would be wonderful if you could provide me with a script that will create individual-specific transcriptomes.

I did not directly use a human reference build since I am working with the 1000Genomes data set, I used their already made vcf file. The 1000Genomes phase 3 variant set was produced using alignments to NCBI GRCh37. I took a subset of that vcf file depending on the population I am working with (CEU) using vcftools as shown below.

$ vcftools —gzvcf kg.ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --keep geuvadis.CEU.kG.ind --remove-filtered-all --remove-indels --phased --mac 2 --max-alleles 2 --recode --recode-INFO-all --out kg.chr22.CEU.mac2

kg.ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz #taken from the 1000Genomes website
geuvadis.CEU.kG.ind # list of individuals I am working with 

I attached the first 500 lines of the chr22 vcf file that was made from the above command. Please let me know if you need more information. Thank you greatly for your help!

Best, Kimberly

On Jan 8, 2016, at 1:50 AM, Ernest Turro notifications@github.com wrote:

Hi Kimberly

I provide ready-made mouse strain-specific transcriptomes (which can be used for allelic imbalance in first-generation crosses) on the GitHub page, but of course, for humans, each individual will need his/her own distinct hybrid transcriptome derived from the phased genotypes, so the human transcriptomes I provide contain the reference haplotype only and are not suitable for analysis of allelic imbalance.

I would like to provide you with a script that creates individual-specific transcriptomes based on a VCF with phased genotypes. It shouldn't take long as I just need to make some changes to the existing mouse_strain_transcriptome.sh script. I think this will be better than using the old haploref.rb script.

Would you mind providing me with a small example VCF (say, a small part of one chromosome) with the phased genotypes and also tell me which human reference build you used? I can then used this example VCF to check that my code will work correctly for you.

Afterwards I'll explain in detail how to obtain the estimates of imbalance (it's essentially what is currently written under Reference files/Ready to download:/Mus musculus on the GitHub page but I'll add some more detailed explanation).

Ernest

On 8 Jan 2016, at 09:18, Kimberly notifications@github.com wrote:

Hello Bryce,

Thank you for getting back to me, I greatly appreciate you helping me out.

I have phased genotypes on which to base the creation of a custom transcriptome. I’m working with the 1000Genomes phase 3 data set. Now I’m not sure that I need to create a custom transcriptome since I am working with humans. Can I just use the ready to download transciptome that you have listed on the github page for human?

My end goal is to use MMSEQ for detecting allelic imbalance and I believe I need to run haploref.rb but I am confused about the inputs required. Do you have more information or even small samples on how the inputs need to be formatted to run haploref.rb? Or how I can go about making the inputs. cdna_file: default transcript FASTA file. gff_file: GFF file containing structure annotation for transcripts in cdna_file. pos_file: file containing, for each transcript, chromosome and positions of SNPs. hap_file: file containing, for each transcript, two versions, e.g. suffixed _A and _B, on separate lines with the alleles for the two haplotypes at each position listed in the pos_file (A and B respectively). I’m also confused about the output of mmseq and where I can obtain information on allelic imbalance once I’ve run through the pipeline. I apologize for not knowing more on this subject.

Thank you again. I really do appreciate your help.

Best, Kimberly

On Jan 6, 2016, at 3:34 AM, Ernest Turro notifications@github.com wrote:

Hi Kimberly

Regarding your previous question, the haploref.rb script was meant to be used with the output of polyHap2 (http://bgx.org.uk/software/mmseq http://bgx.org.uk/software/mmseq), which was written by a colleague and is no longer maintained as far as I am aware.

The best way to make a custom transcriptome now is by using a modified version of mouse_strain_transcriptome.sh.

Do you have phased genotypes (SNPs and indels) on which to base the creation of your custom transcriptome already?

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-169292452.

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-169929103.

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-169937724.

eturro commented 8 years ago

Hi Kimberly

The attachment didn't make it through.. would you be able to provide a link instead please or otherwise email it directly to me? I have some time this weekend to work on the code for this and I could really do with an example file for testing.

Thanks! Ernest

On 8 Jan 2016, at 19:12, Kimberly notifications@github.com wrote:

Hello,

That would be wonderful if you could provide me with a script that will create individual-specific transcriptomes.

I did not directly use a human reference build since I am working with the 1000Genomes data set, I used their already made vcf file. The 1000Genomes phase 3 variant set was produced using alignments to NCBI GRCh37. I took a subset of that vcf file depending on the population I am working with (CEU) using vcftools as shown below.

$ vcftools —gzvcf kg.ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --keep geuvadis.CEU.kG.ind --remove-filtered-all --remove-indels --phased --mac 2 --max-alleles 2 --recode --recode-INFO-all --out kg.chr22.CEU.mac2

kg.ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz #taken from the 1000Genomes website geuvadis.CEU.kG.ind # list of individuals I am working with

I attached the first 500 lines of the chr22 vcf file that was made from the above command. Please let me know if you need more information. Thank you greatly for your help!

Best, Kimberly

On Jan 8, 2016, at 1:50 AM, Ernest Turro notifications@github.com wrote:

Hi Kimberly

I provide ready-made mouse strain-specific transcriptomes (which can be used for allelic imbalance in first-generation crosses) on the GitHub page, but of course, for humans, each individual will need his/her own distinct hybrid transcriptome derived from the phased genotypes, so the human transcriptomes I provide contain the reference haplotype only and are not suitable for analysis of allelic imbalance.

I would like to provide you with a script that creates individual-specific transcriptomes based on a VCF with phased genotypes. It shouldn't take long as I just need to make some changes to the existing mouse_strain_transcriptome.sh script. I think this will be better than using the old haploref.rb script.

Would you mind providing me with a small example VCF (say, a small part of one chromosome) with the phased genotypes and also tell me which human reference build you used? I can then used this example VCF to check that my code will work correctly for you.

Afterwards I'll explain in detail how to obtain the estimates of imbalance (it's essentially what is currently written under Reference files/Ready to download:/Mus musculus on the GitHub page but I'll add some more detailed explanation).

Ernest

On 8 Jan 2016, at 09:18, Kimberly notifications@github.com wrote:

Hello Bryce,

Thank you for getting back to me, I greatly appreciate you helping me out.

I have phased genotypes on which to base the creation of a custom transcriptome. I’m working with the 1000Genomes phase 3 data set. Now I’m not sure that I need to create a custom transcriptome since I am working with humans. Can I just use the ready to download transciptome that you have listed on the github page for human?

My end goal is to use MMSEQ for detecting allelic imbalance and I believe I need to run haploref.rb but I am confused about the inputs required. Do you have more information or even small samples on how the inputs need to be formatted to run haploref.rb? Or how I can go about making the inputs. cdna_file: default transcript FASTA file. gff_file: GFF file containing structure annotation for transcripts in cdna_file. pos_file: file containing, for each transcript, chromosome and positions of SNPs. hap_file: file containing, for each transcript, two versions, e.g. suffixed _A and _B, on separate lines with the alleles for the two haplotypes at each position listed in the pos_file (A and B respectively). I’m also confused about the output of mmseq and where I can obtain information on allelic imbalance once I’ve run through the pipeline. I apologize for not knowing more on this subject.

Thank you again. I really do appreciate your help.

Best, Kimberly

On Jan 6, 2016, at 3:34 AM, Ernest Turro notifications@github.com wrote:

Hi Kimberly

Regarding your previous question, the haploref.rb script was meant to be used with the output of polyHap2 (http://bgx.org.uk/software/mmseq http://bgx.org.uk/software/mmseq), which was written by a colleague and is no longer maintained as far as I am aware.

The best way to make a custom transcriptome now is by using a modified version of mouse_strain_transcriptome.sh.

Do you have phased genotypes (SNPs and indels) on which to base the creation of your custom transcriptome already?

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-169292452.

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-169929103.

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-169937724.

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-170096338.

olneykimberly commented 8 years ago

Hello,

I’m not sure how to email you directly, would I send the email to notifications@github.com mailto:notifications@github.com ?

Here is the command to get the vcf file from NCBI wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

Once download is complete I ran the vcftools command below to filter for the individuals I wanted to keep in my data set and to remove indwells and the data is already phased vcftools --gzvcf kg.ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --keep geuvadis.CEU.kG.ind --remove-filtered-all --remove-indels --mac 2 --max-alleles 2 --recode --recode-INFO-all --out kg.chr22.CEU.mac2

Here is the list of the CEU population sample IDs. ( --keep geuvadis.CEU.kG.ind) NA06984 NA06985 NA06986 NA06989 NA06994 NA07037 NA07048 NA07051 NA07056 NA07347 NA07357 NA10847 NA10851 NA11829 NA11830 NA11831 NA11832 NA11840 NA11843 NA11881 NA11892 NA11893 NA11894 NA11918 NA11920 NA11930 NA11931 NA11992 NA11994 NA11995 NA12004 NA12005 NA12006 NA12043 NA12044 NA12045 NA12058 NA12144 NA12154 NA12155 NA12156 NA12234 NA12249 NA12272 NA12273 NA12275 NA12282 NA12283 NA12286 NA12287 NA12340 NA12341 NA12342 NA12347 NA12348 NA12383 NA12399 NA12400 NA12413 NA12489 NA12546 NA12716 NA12717 NA12718 NA12749 NA12750 NA12751 NA12760 NA12761 NA12762 NA12763 NA12775 NA12776 NA12777 NA12778 NA12812 NA12813 NA12814 NA12815 NA12827 NA12829 NA12830 NA12842 NA12843 NA12872 NA12873 NA12874 NA12889 NA12890

Please let me know if you have any questions. Thank you so much for your help. I’m really looking forward to using MMSEQ!

Best, Kimberly

On Jan 23, 2016, at 11:00 AM, Ernest Turro notifications@github.com wrote:

Hi Kimberly

The attachment didn't make it through.. would you be able to provide a link instead please or otherwise email it directly to me? I have some time this weekend to work on the code for this and I could really do with an example file for testing.

Thanks! Ernest

On 8 Jan 2016, at 19:12, Kimberly notifications@github.com wrote:

Hello,

That would be wonderful if you could provide me with a script that will create individual-specific transcriptomes.

I did not directly use a human reference build since I am working with the 1000Genomes data set, I used their already made vcf file. The 1000Genomes phase 3 variant set was produced using alignments to NCBI GRCh37. I took a subset of that vcf file depending on the population I am working with (CEU) using vcftools as shown below.

$ vcftools —gzvcf kg.ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --keep geuvadis.CEU.kG.ind --remove-filtered-all --remove-indels --phased --mac 2 --max-alleles 2 --recode --recode-INFO-all --out kg.chr22.CEU.mac2

kg.ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz #taken from the 1000Genomes website geuvadis.CEU.kG.ind # list of individuals I am working with

I attached the first 500 lines of the chr22 vcf file that was made from the above command. Please let me know if you need more information. Thank you greatly for your help!

Best, Kimberly

On Jan 8, 2016, at 1:50 AM, Ernest Turro notifications@github.com wrote:

Hi Kimberly

I provide ready-made mouse strain-specific transcriptomes (which can be used for allelic imbalance in first-generation crosses) on the GitHub page, but of course, for humans, each individual will need his/her own distinct hybrid transcriptome derived from the phased genotypes, so the human transcriptomes I provide contain the reference haplotype only and are not suitable for analysis of allelic imbalance.

I would like to provide you with a script that creates individual-specific transcriptomes based on a VCF with phased genotypes. It shouldn't take long as I just need to make some changes to the existing mouse_strain_transcriptome.sh script. I think this will be better than using the old haploref.rb script.

Would you mind providing me with a small example VCF (say, a small part of one chromosome) with the phased genotypes and also tell me which human reference build you used? I can then used this example VCF to check that my code will work correctly for you.

Afterwards I'll explain in detail how to obtain the estimates of imbalance (it's essentially what is currently written under Reference files/Ready to download:/Mus musculus on the GitHub page but I'll add some more detailed explanation).

Ernest

On 8 Jan 2016, at 09:18, Kimberly notifications@github.com wrote:

Hello Bryce,

Thank you for getting back to me, I greatly appreciate you helping me out.

I have phased genotypes on which to base the creation of a custom transcriptome. I’m working with the 1000Genomes phase 3 data set. Now I’m not sure that I need to create a custom transcriptome since I am working with humans. Can I just use the ready to download transciptome that you have listed on the github page for human?

My end goal is to use MMSEQ for detecting allelic imbalance and I believe I need to run haploref.rb but I am confused about the inputs required. Do you have more information or even small samples on how the inputs need to be formatted to run haploref.rb? Or how I can go about making the inputs. cdna_file: default transcript FASTA file. gff_file: GFF file containing structure annotation for transcripts in cdna_file. pos_file: file containing, for each transcript, chromosome and positions of SNPs. hap_file: file containing, for each transcript, two versions, e.g. suffixed _A and _B, on separate lines with the alleles for the two haplotypes at each position listed in the pos_file (A and B respectively). I’m also confused about the output of mmseq and where I can obtain information on allelic imbalance once I’ve run through the pipeline. I apologize for not knowing more on this subject.

Thank you again. I really do appreciate your help.

Best, Kimberly

On Jan 6, 2016, at 3:34 AM, Ernest Turro notifications@github.com wrote:

Hi Kimberly

Regarding your previous question, the haploref.rb script was meant to be used with the output of polyHap2 (http://bgx.org.uk/software/mmseq http://bgx.org.uk/software/mmseq), which was written by a colleague and is no longer maintained as far as I am aware.

The best way to make a custom transcriptome now is by using a modified version of mouse_strain_transcriptome.sh.

Do you have phased genotypes (SNPs and indels) on which to base the creation of your custom transcriptome already?

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-169292452.

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-169929103.

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-169937724.

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-170096338.

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-174208830.

eturro commented 8 years ago

Thanks for sending the commands. If you could email the example file to et341@cam.ac.uk as well that would be great

Best wishes Ernest

On 23 Jan 2016, at 19:38, Kimberly notifications@github.com wrote:

Hello,

I’m not sure how to email you directly, would I send the email to notifications@github.com mailto:notifications@github.com ?

Here is the command to get the vcf file from NCBI wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

Once download is complete I ran the vcftools command below to filter for the individuals I wanted to keep in my data set and to remove indwells and the data is already phased vcftools --gzvcf kg.ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --keep geuvadis.CEU.kG.ind --remove-filtered-all --remove-indels --mac 2 --max-alleles 2 --recode --recode-INFO-all --out kg.chr22.CEU.mac2

Here is the list of the CEU population sample IDs. ( --keep geuvadis.CEU.kG.ind) NA06984 NA06985 NA06986 NA06989 NA06994 NA07037 NA07048 NA07051 NA07056 NA07347 NA07357 NA10847 NA10851 NA11829 NA11830 NA11831 NA11832 NA11840 NA11843 NA11881 NA11892 NA11893 NA11894 NA11918 NA11920 NA11930 NA11931 NA11992 NA11994 NA11995 NA12004 NA12005 NA12006 NA12043 NA12044 NA12045 NA12058 NA12144 NA12154 NA12155 NA12156 NA12234 NA12249 NA12272 NA12273 NA12275 NA12282 NA12283 NA12286 NA12287 NA12340 NA12341 NA12342 NA12347 NA12348 NA12383 NA12399 NA12400 NA12413 NA12489 NA12546 NA12716 NA12717 NA12718 NA12749 NA12750 NA12751 NA12760 NA12761 NA12762 NA12763 NA12775 NA12776 NA12777 NA12778 NA12812 NA12813 NA12814 NA12815 NA12827 NA12829 NA12830 NA12842 NA12843 NA12872 NA12873 NA12874 NA12889 NA12890

Please let me know if you have any questions. Thank you so much for your help. I’m really looking forward to using MMSEQ!

Best, Kimberly

On Jan 23, 2016, at 11:00 AM, Ernest Turro notifications@github.com wrote:

Hi Kimberly

The attachment didn't make it through.. would you be able to provide a link instead please or otherwise email it directly to me? I have some time this weekend to work on the code for this and I could really do with an example file for testing.

Thanks! Ernest

On 8 Jan 2016, at 19:12, Kimberly notifications@github.com wrote:

Hello,

That would be wonderful if you could provide me with a script that will create individual-specific transcriptomes.

I did not directly use a human reference build since I am working with the 1000Genomes data set, I used their already made vcf file. The 1000Genomes phase 3 variant set was produced using alignments to NCBI GRCh37. I took a subset of that vcf file depending on the population I am working with (CEU) using vcftools as shown below.

$ vcftools —gzvcf kg.ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --keep geuvadis.CEU.kG.ind --remove-filtered-all --remove-indels --phased --mac 2 --max-alleles 2 --recode --recode-INFO-all --out kg.chr22.CEU.mac2

kg.ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz #taken from the 1000Genomes website geuvadis.CEU.kG.ind # list of individuals I am working with

I attached the first 500 lines of the chr22 vcf file that was made from the above command. Please let me know if you need more information. Thank you greatly for your help!

Best, Kimberly

On Jan 8, 2016, at 1:50 AM, Ernest Turro notifications@github.com wrote:

Hi Kimberly

I provide ready-made mouse strain-specific transcriptomes (which can be used for allelic imbalance in first-generation crosses) on the GitHub page, but of course, for humans, each individual will need his/her own distinct hybrid transcriptome derived from the phased genotypes, so the human transcriptomes I provide contain the reference haplotype only and are not suitable for analysis of allelic imbalance.

I would like to provide you with a script that creates individual-specific transcriptomes based on a VCF with phased genotypes. It shouldn't take long as I just need to make some changes to the existing mouse_strain_transcriptome.sh script. I think this will be better than using the old haploref.rb script.

Would you mind providing me with a small example VCF (say, a small part of one chromosome) with the phased genotypes and also tell me which human reference build you used? I can then used this example VCF to check that my code will work correctly for you.

Afterwards I'll explain in detail how to obtain the estimates of imbalance (it's essentially what is currently written under Reference files/Ready to download:/Mus musculus on the GitHub page but I'll add some more detailed explanation).

Ernest

On 8 Jan 2016, at 09:18, Kimberly notifications@github.com wrote:

Hello Bryce,

Thank you for getting back to me, I greatly appreciate you helping me out.

I have phased genotypes on which to base the creation of a custom transcriptome. I’m working with the 1000Genomes phase 3 data set. Now I’m not sure that I need to create a custom transcriptome since I am working with humans. Can I just use the ready to download transciptome that you have listed on the github page for human?

My end goal is to use MMSEQ for detecting allelic imbalance and I believe I need to run haploref.rb but I am confused about the inputs required. Do you have more information or even small samples on how the inputs need to be formatted to run haploref.rb? Or how I can go about making the inputs. cdna_file: default transcript FASTA file. gff_file: GFF file containing structure annotation for transcripts in cdna_file. pos_file: file containing, for each transcript, chromosome and positions of SNPs. hap_file: file containing, for each transcript, two versions, e.g. suffixed _A and _B, on separate lines with the alleles for the two haplotypes at each position listed in the pos_file (A and B respectively). I’m also confused about the output of mmseq and where I can obtain information on allelic imbalance once I’ve run through the pipeline. I apologize for not knowing more on this subject.

Thank you again. I really do appreciate your help.

Best, Kimberly

On Jan 6, 2016, at 3:34 AM, Ernest Turro notifications@github.com wrote:

Hi Kimberly

Regarding your previous question, the haploref.rb script was meant to be used with the output of polyHap2 (http://bgx.org.uk/software/mmseq http://bgx.org.uk/software/mmseq), which was written by a colleague and is no longer maintained as far as I am aware.

The best way to make a custom transcriptome now is by using a modified version of mouse_strain_transcriptome.sh.

Do you have phased genotypes (SNPs and indels) on which to base the creation of your custom transcriptome already?

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-169292452.

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-169929103.

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-169937724.

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-170096338.

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/7#issuecomment-174208830.

— Reply to this email directly or view it on GitHub.