Closed OmerRonen closed 3 years ago
Thank you for using py-pgatk Omer.
First, the mutations in theVCF
file have to be annotated using VEP, for instance.
Next, the gene set that was used for the annotation in the first step should be convrted from GTF
to FASTA
format using gffread
.
Thus the input to the gffread
would be:
-w: some name to write the outputfile to
-g: genome fasta file (e.g. GRCh38.fa)
the last parameter is path to the gtf file that was used to annotate the VCF
file, e.g GRCh38.102.gtf
Finally, the GTF
and the FASTA
files should be given as input to the vcf-to-proteindb
command along with the VCF
to genrate the mutated protein database.
Please let me know if this does not work for you. /Husen
@husensofteng can you put the commands needs to perform all the tasks?
Hi @husensofteng, thanks for the answer. When I run the command using the suggested files
gffread -w out.fasta -g Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz ensembl_files/Homo_sapiens.GRCh38.102.gtf.gz
I get the following error:
Error: sequence lines in a FASTA record must have the same length!
Am I doing something wrong?
Sorry for the delay to reply
It seems your FASTA
file does not follow the standard format.
Could you paste here a few lines from you FASTA
file, also what is the source of the FASTA
file?
Hi @husensofteng - thank for answering I used the fasta file you've suggested in https://github.com/bigbio/py-pgatk/issues/29#issuecomment-762933914
-g: genome fasta file (e.g. GRCh38.fa)
gffread
exprects the files to be extracted. Could you extract/unzip the files first and try again
e.g.
gunzip *.gz
@husensofteng thanks for the help, I now have another problem. I try to run
pypgatk_cli vcf-to-proteindb --vep_annotated_vcf $BASE_PATH/file_annotated.vcf --input_fasta $BASE_PATH/file.fasta --gene_annotations_gtf $BASE_PATH/ensembl_files/Homo_sapiens.GRCh38.102.gtf --output_proteindb $BASE_PATH/proteins.fa
The proteins.fa
is empty as the end of the job and the logging stops at:
2021-01-26 12:03:44,357 - INFO - Populating features table and first-order relations: 3010594 features
2021-01-26 12:03:44,416 - INFO - Creating relations(parent) index
2021-01-26 12:03:52,637 - INFO - Creating relations(child) index
2021-01-26 12:03:57,638 - INFO - Creating features(featuretype) index
2021-01-26 12:04:05,730 - INFO - Creating features (seqid, start, end) index
2021-01-26 12:04:09,039 - INFO - Creating features (seqid, start, end, strand) index
2021-01-26 12:04:12,508 - INFO - Running ANALYSE features
Do you know what the problem may be? Thanks for the help!
Dear @OmerRonen, we have made many changes in the code. Could you please give it another try using latest version and report the details if you still get any error.
First obtain the required files and unzip them (only cdna and gtf are needed):
cdna_fasta_file
wget http://ftp.ensembl.org/pub/release-103/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz
gtf_file
wget http://ftp.ensembl.org/pub/release-103/gtf/homo_sapiens/Homo_sapiens.GRCh38.103.gtf.gz
gunzip Homo_sapiens.GRCh38.103.gtf.gz
Next, directly run the following command to generate a protein database for the given VCF
file:
pypgatk_cli.py vcf-to-proteindb --config_file pypgatk/config/ensembl_config.yaml --annotation_field_name '' --input_fasta _Homo_sapiens.GRCh38.cdna.all.fa_ --gene_annotations_gtf Homo_sapiens.GRCh38.103.gtf --vcf _path_to_vcf_file_ --output_proteindb vcf_db.fa
P.S. this assumes that your VCF file is aligned to GRCh38.
Hi, I'm trying to use your package to translate vcd files to mutate protein sequence, I don't quite understand how to generate the Fasta file is the a chance to clarify what the arguments to the following command should be:
Thanks!