Duhyadi / Deleterious-alleles-in-landraces-of-maize

0 stars 2 forks source link

Convert vcf to fasta #1

Open Duhyadi opened 5 years ago

Duhyadi commented 5 years ago

Hi, Can you help me? I will thank you

My intention (at least for now) is to carry out the identification of deleterious alleles using a plant-specific software called BAD MUTATIONS, BLAST Aligned-Deleterious (BAD-M).

Introduction. It was not easy to install the dependencies. It was necessary to familiarize myself with python, ancaconda and bioconda. At first I thought my doubts would be about it. Fortunately I managed to move forward. Then you depend on them (installed and available in your $PATH or sys.path in Python):

 GNU Bash >= 3.2
 Python >= 2.6.x
 Biopython 1.6x 
 argparse (Python library) If using Python 2.6
 BLAST+ >= 2.2.29 
 PASTA 
 HyPhy 2.2.x 
 cURL 

To start the analysis, the following input files are required:

The FASTA input:

CBF3_Morex ATGTCTCCCAC... … And the substitutions file: 21 SNP_1 45 SNP_2 50 SNP_3 100 SNP_4

My doubts are specifically regarding the input files:

  1. Just to play and prove that the package installed for BM works. It occurred to me to start a test analysis by converting the vcf file in Data_ Arteaga_et_al_2016 to fasta. I know that the analysis that BM would throw would be wrong since the vcf does not contain the complete sequences. That is, it does not contain the triplets, which are required to carry out the analysis. Despite this I thought it was important to do so. However, I have no clarity in the recommendations of convert “VCF to FASTA # 693” issue. Next the command:

       bcftools +setGT in.vcf -- --target-gt . --new-gt 0p

I think the ideal would be to make an "imputation" with specialized software for it. But I'm not sure about it. Any recommendations?

  1. I am also not sure how to obtain the substitutions file. Can it be obtained from the ped file?

    For your attention thanks

LauraMCE commented 5 years ago

Hi! I would like to help you, but I don't know exactly what data I should use! Please indicate data location in the issue.

LauraMCE commented 5 years ago

Hi! Here is the issue where the code for VCF to FASTA comes from

Rodolfo47 commented 5 years ago

hi, its look like you need the reference genome of your study specie I suggest you first download the reference genome in fasta format and put it in your repository

LauraMCE commented 5 years ago

HI! Here is all the info about the plugin +setGT. I hope it will be helpful

Rodolfo47 commented 5 years ago

once you have al your required files, you could try execute this script like this:

./vcf2fq.pl -f <input.fasta> <all-site.vcf> > <output.fastq>

I think first you must instal perl

LauraMCE commented 5 years ago

Hi! I found another code to do what you need. Check it bcftools consensus all-site.vcf.gz < input.fasta > output.fasta

FernandaDiaz12 commented 5 years ago

To download bcftools you download here. Then decompress it. and: cd bcftools-1.9

./configure --prefix=/where/to/install

make install

Martinez-Gregorio-Hector commented 5 years ago

This is the link to download the zea mays: https://plants.ensembl.org/info/website/ftp/index.html

LauraMCE commented 5 years ago

Hi!! Here are info about indexing FASTA reference genome with SAMtools.

FernandaDiaz12 commented 5 years ago

Hi, to use a reference genome it has to be indexed.

FernandaDiaz12 commented 5 years ago

Hi! Here are the things that we have to do BEFORE trying to convert to fasta:

bcftools consensus -f ref_geneA.fa calls.vcf.gz > consensus.fa

Melcatus commented 5 years ago

About your coment respect index with bwa: the following BWA command: bwa index -a bwtsw reference.fa

where -a bwtsw specifies that we want to use the indexing algorithm that is capable of handling the whole human genome.

-a index algorithm (bwtsw for long genomes and is for short genomes)

necrosnake91 commented 5 years ago

Hi! As I told you the last Wednesday, you can use STAR to make your index. I'm not sure whether STAR could perform this task for the maize but I leave you the general code to do it:

STAR --runThreadN #number of threads or cores \
--runMode genomeGenerate \
--genomeDir #Path in which you will store the index \
--genomeFastaFiles #Path to your FASTA file \
--sjdbGTFfile #Path to your GTF file \
--sjdb0verhang 99

Hope this tool will be helpful for your analysis.

FernandaDiaz12 commented 5 years ago

Hi, again!

Here is what I've been trying to run. index the Reference genome. Even though I've been running it in the lab's server, I have struggled since it is a very BIG genome.

Nevertheless...

when the indexed genome is done I'll try to uploaded here via weTransfer.

Here is the code:

##First clone the repository Deleterious-alleles-in-landraces-of-maize git clone https://github.com/Duhyadi/Deleterious-alleles-in-landraces-of-maize.git cd Deleterious-alleles-in-landraces-of-maize

#Then create a directory for the reference genome mkdir Maize cd Maize

Downlad reference genome

##Download Reference genome Zea_mays.B73_RefGen_v4.dna_sm.toplevel.fa.gz
#Updated 5/30/19
# From ensembl/plants > Zea mays > Download DNA sequence (FASTA) 

wget ftp://ftp.ensemblgenomes.org/pub/plants/release-44/fasta/zea_mays/dna/Zea_mays.B73_RefGen_v4.dna_sm.toplevel.fa.gz
##An easiest way to download such big files is using axel 
wget http://wilmer.gaast.net/downloads/axel-1.0b.tar.gz
tar -zxvf axel-1.0b.tar.gz
cd axel-1.0b
./configure
make
make install 
###Then download the Reference genome 40-60% faster
axel ftp://ftp.ensemblgenomes.org/pub/plants/release-44/fasta/zea_mays/dna/Zea_mays.B73_RefGen_v4.dna_sm.toplevel.fa.gz

Download Annotation

##Download Annotation Zea_mays.B73_RefGen_v4.44.gff3.gz
#Updated 6/2/19
# From ensembl/plants >  > Zea mays > Gene annotation/gff3
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-44/gff3/zea_mays/Zea_mays.B73_RefGen_v4.44.gff3.gz

##Create output directory mkdir Star_index cd Star_index

Then run STAR to index the genome as Rodolfo commented here

STAR --runThreadN 18 --runMode genomeGenerate --genomeDir Maize/Star_Index --genomeFastaFiles Maize/Zea_mays.B73_RefGen_v4.dna_sm.toplevel.fa --sjdbGTFfile Maize/Zea_mays.B73_RefGen_v4.44.gff3 --sjdb0verhang 99

Install bcftools

wget https://github.com/samtools/bcftools/releases/download/1.9/bcftools-1.9.tar.bz2
cd  bcftools-1.9
./configure --prefix=../Maize
make install

Compress vcf_file

cd .. 
cd Arteaga_et_al_2016
cd Data 
bgzip new_final_26_march.vcf

Hope it helps :)

necrosnake91 commented 5 years ago

Hi again!

Looking at the code that Fer left for you, I've realized that the annotations are in GFF3 format. I'm sure that you have to use this argument --sjdbGTFtagExonParentTranscript instead of --sjdbGTFfile.

See you at the next class!

Duhyadi commented 5 years ago

We can try this, thanks for every one

bcftools consensus -f ~/Documents/2020_1/Clase_Camille/Fernanda_genoma_indexado/Zea_mays.B73_RefGen_v4.dna_sm.toplevel.fa -s ~/Documents/2020_1/Clase_Camille/Mi_nuevo_repo/Arteaga_et_al_2016/Data/new_final_26_march.vcf.gz -o Fernanda.fa

LauraMCE commented 5 years ago

Hi! In Samtools Manual we have realized that -s needs a name, not a file. So, you can try this:

bcftools consensus -f ~/Documents/2020_1/Clase_Camille/Fernanda_genoma_indexado/Zea_mays.B73_RefGen_v4.dna_sm.toplevel.fa ~/Documents/2020_1/Clase_Camille/Mi_nuevo_repo/Arteaga_et_al_2016/Data/new_final_26_march.vcf.gz > Fernanda.fa
Duhyadi commented 5 years ago

We managed to convert from vcf to fasta with the following steps:

  1. The reference genome was downloaded.
  2. The reference genome was index with star.
  3. The file vcf was compressed with the next code. First was necesary apply bgzipfile.vcf and then tabix file.vcf.gz.
  4. Its important have the files in the same directory.
  5. The code final es bcftools consensus -c -f ../Data/Fernanda_genoma_indexado/Zea_mays.B73_RefGen_v4.dna_sm.toplevel.fa ../Data/new_final_26_march.vcf.gz -o Fernanda_corn.fa

    The sad ):

    Only have a small sequence of fasta.

FernandaDiaz12 commented 5 years ago

In here you will find the Question: Extract SNPs flanking sequences based on VCF and genome Fasta files. From BioStars.

Duhyadi commented 4 years ago

I followed the instructions of biostars. Use Pysam. The files used were: the vcf file (contains SNPs) and the reference genome (fasta). However, I could not carry out the transformation. The following instructions were not clear to me:

The mistake we can see here and here

I looked in the Pysam manual and I think that such a conversion cannot be done. I think ... I really don't know. For now I will stop trying to get analysis with BAD MUTATIONS.