bigbio / py-pgatk

Python tools for proteogenomics analysis toolkit
Apache License 2.0
10 stars 11 forks source link

Fasta input for vcf-to-proteindb #29

Closed OmerRonen closed 3 years ago

OmerRonen commented 3 years ago

Hi, I'm trying to use your package to translate vcd files to mutate protein sequence, I don't quite understand how to generate the Fasta file is the a chance to clarify what the arguments to the following command should be:

gffread -F -w input_fasta.fa -g genome.fa gene_annotations_gtf

Thanks!

husensofteng commented 3 years ago

Thank you for using py-pgatk Omer.

First, the mutations in theVCF file have to be annotated using VEP, for instance.

Next, the gene set that was used for the annotation in the first step should be convrted from GTF to FASTA format using gffread.

Thus the input to the gffread would be: -w: some name to write the outputfile to -g: genome fasta file (e.g. GRCh38.fa) the last parameter is path to the gtf file that was used to annotate the VCF file, e.g GRCh38.102.gtf

Finally, the GTF and the FASTA files should be given as input to the vcf-to-proteindb command along with the VCF to genrate the mutated protein database.

Please let me know if this does not work for you. /Husen

ypriverol commented 3 years ago

@husensofteng can you put the commands needs to perform all the tasks?

OmerRonen commented 3 years ago

Hi @husensofteng, thanks for the answer. When I run the command using the suggested files

gffread -w out.fasta -g Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz ensembl_files/Homo_sapiens.GRCh38.102.gtf.gz

I get the following error:

Error: sequence lines in a FASTA record must have the same length!

Am I doing something wrong?

husensofteng commented 3 years ago

Sorry for the delay to reply

It seems your FASTA file does not follow the standard format. Could you paste here a few lines from you FASTA file, also what is the source of the FASTA file?

OmerRonen commented 3 years ago

Hi @husensofteng - thank for answering I used the fasta file you've suggested in https://github.com/bigbio/py-pgatk/issues/29#issuecomment-762933914

-g: genome fasta file (e.g. GRCh38.fa)

husensofteng commented 3 years ago

gffread exprects the files to be extracted. Could you extract/unzip the files first and try again

e.g. gunzip *.gz

OmerRonen commented 3 years ago

@husensofteng thanks for the help, I now have another problem. I try to run

pypgatk_cli vcf-to-proteindb --vep_annotated_vcf $BASE_PATH/file_annotated.vcf --input_fasta $BASE_PATH/file.fasta --gene_annotations_gtf  $BASE_PATH/ensembl_files/Homo_sapiens.GRCh38.102.gtf --output_proteindb $BASE_PATH/proteins.fa

The proteins.fa is empty as the end of the job and the logging stops at:

2021-01-26 12:03:44,357 - INFO - Populating features table and first-order relations: 3010594 features
2021-01-26 12:03:44,416 - INFO - Creating relations(parent) index
2021-01-26 12:03:52,637 - INFO - Creating relations(child) index
2021-01-26 12:03:57,638 - INFO - Creating features(featuretype) index
2021-01-26 12:04:05,730 - INFO - Creating features (seqid, start, end) index
2021-01-26 12:04:09,039 - INFO - Creating features (seqid, start, end, strand) index
2021-01-26 12:04:12,508 - INFO - Running ANALYSE features

Do you know what the problem may be? Thanks for the help!

husensofteng commented 3 years ago

Dear @OmerRonen, we have made many changes in the code. Could you please give it another try using latest version and report the details if you still get any error.

First obtain the required files and unzip them (only cdna and gtf are needed):

Next, directly run the following command to generate a protein database for the given VCF file:

P.S. this assumes that your VCF file is aligned to GRCh38.