chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
102 stars 37 forks source link

vcf2revcom.pl - No modifications to the fasta and gtf file #14

Closed lchallagundla closed 5 years ago

lchallagundla commented 5 years ago

Hello, I'm trying to run SNPgenie-vcf2revcom.pl on a vcf file, the corresponding reference fasta and gtf file as per the instructions on your github page. However, just running the script with the three input files throws out an error as below suggesting the input should be just the SNP report (vcf in this case) and the corresponding total number of sites in the fasta file (See Below). If I am understanding correctly you are just modifying the SNP report and not the gff or the fasta for the negative strand? Did you modify your SNPgenie script to account for the negative strand or am I missing something here. I would really appreciate it if you could provide any clarification to this.

mac0167359:SNPGenie$ perl vcf2revcom.pl ref.fa ref.gtf All2.vcf

WARNING: The SNPGenie script vcfformat1_to_revcom needs exactly 2 arguments, in this order:

(1) A '+' strand SNP report in VCF format 1 (SNPGenie descriptions).

(2) The exact length of the sequence in the FASTA file.

For example: vcfformat1_to_revcom.pl my_snp_report.vcf 248956422

mac0167359:SNPGenie $ perl vcf2revcom.pl All2.vcf 2872915 Provided seq length is 2872915

Converting All2.vcf to reverse complement format...

REVERSE COMPLEMENT VCF FILE has been written to:

VCF: All2_revcom.vcf

singing-scientist commented 5 years ago

Thanks for using SNPGenie, and apologies that I lost track of this! Those instructions refer to the script in the other repository. For the vc2revcom script at the SNPGenie page, please run as follows:

vcf2revcom.pl <file_name>.vcf <seq_length>

Thus, you will also need to know the total length of your sequence as well. Let me know if that works as expected for you, and if so, we can move on to revcom for GTF and FASTA.

Please let me know if this works! If not, please share a sample file so I may troubleshoot.

lchallagundla commented 5 years ago

I did convert the Snp report as per the instructions with the length of the fasta file. My question still relates to how are you reverse complementing the gtf and reference fasta. Thanks.

singing-scientist commented 5 years ago

Coordinates can be reverse-complemented simply as length - position + 1, and the reverse complement of FASTA files can be found by reversing the sequence and replacing A>T, C>G, G>C, and T>A. I now provide scripts to do this, which can be run as follows — apologies that I have not yet had time to document them.

gtf2revcom.pl <file_name>.gtf <seq_length> fasta2revcom.pl <seq>.fa

Let me know if those work!

lchallagundla commented 5 years ago

Great. I will try these and let you know if they work! Thanks.