chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
102 stars 37 forks source link

vcf2revcom.pl - wrong gtf output? #12

Closed NurislamSheih closed 5 years ago

NurislamSheih commented 5 years ago

Dear @cwnelson88,

Thank you a lot for your tool, it is great and really helpful. I calculated the pn/ps ratio using "+" genes from GTF file. But I also need to calculate metrics with "-" genes. I used your script vcf2revcom from Additional scripts window. And how I see, it gives reversed fasta file from end to beginning and the same with vcf file, but when I look on resulting gtf file it still the same as gtf before launching script and when I start calculation with snpgenie script, it gives the same result as with "+" genes but wrong pn/ps in that case.

Briefly, normal gtf and gtf_revcom are the same. What can I do with that? Maybe your script need something specific?

singing-scientist commented 5 years ago

Thanks very much for using this tool! Could you provide sample (e.g., smallest chromosome) FASTA/GTF/VCF files? I have long needed to write better scripts for creating the revcom files and can do that now if it will be useful. The old vcf2revcom is very limited in input format and may not give adequate error messages.

NurislamSheih commented 5 years ago

Oh, thank you!

Here is my files: https://drive.google.com/open?id=13fUkXqe6Ytx-V2hRid0aR2wCOdgx9lSm

singing-scientist commented 5 years ago

Thank you! Just to verify, you only need to convert the VCF file to reverse complement?

singing-scientist commented 5 years ago

@NurislamSheih also, please provide the length of the reference sequence (here, chr_4) so the script can calculate the revcom coordinates.

NurislamSheih commented 5 years ago

@cwnelson88 how I understand I need to convert all three files to reverse complement to calculate "-" genes. Or I wrong?

I should put the length of my reference in fasta file (like this >chr_4:14019908)? Or you mean something another?

singing-scientist commented 5 years ago

@NurislamSheih correct, all three files are needed. The reason I am confused is that the link you sent only contains the VCF file.

NurislamSheih commented 5 years ago

Wow, sorry! This link for three files https://drive.google.com/open?id=1tJNsePkNpF2Zn3ZEdHnsAO5x-neTVHlU

singing-scientist commented 5 years ago

@NurislamSheih apologies for the delay — my computer got smashed over the holidays. (I know, I know: 'the dog ate my homework'). For the VCF file, please find the script vcfformat_to_revcom.pl at the SNPGenie page. Try running it as follows:

vcfformat_to_revcom.pl <file_name>.vcf <seq_length>

In this case, it looks like your sequence length is 14019908.

Let me know if that works as expected for you, and if so, we can move on to revcom for GTF and FASTA.

NurislamSheih commented 5 years ago

Happy New Year @cwnelson88 , I hope that your computer is ok now!)

Thank you a lot, I will try your script.

NurislamSheih commented 5 years ago

@cwnelson88 , I used your script, it gives what I need. Thank you! So now, you will add options to get reverse complement gtf and fasta files?

singing-scientist commented 5 years ago

Wonderful, @NurislamSheih , I'm very glad it worked! Now please find the scripts gtf2revcom.pl and fasta2revcom.pl, to be run as follows:

gtf2revcom.pl <file_name>.gtf <seq_length> fasta2revcom.pl <seq>.fa

Please let me know if this worked and the results are accurate. If all three files have been successfully converted, you should be able to run SNPGenie for the minus strand using all 3 revcom files as input.

N.B.: I changed the name of the vcf script to 'vcf2revcom.pl' for consistency.

NurislamSheih commented 5 years ago

@cwnelson88 , sorry for delay! The script that makes reverse complement of fasta file is working perfectly, but gtf script don't make reverse complement. I tried different variants of my gtf files, and no one gives the proper result. Also script gives wrong number of products, it gust count number of lines (12137) (not proper number of products for chr_4 - 3226 genes)

Here I put different variant of gtf: https://drive.google.com/drive/folders/1tnFerD3Dldt714MoT5k3hWgEZuprPhTm?usp=sharing

singing-scientist commented 5 years ago

No problem, @NurislamSheih ! Sorry for the bugs. Please download the new version of the gtf2revcom.pl script and let me know if it works as expected. The coordinates should be start/stop with respect to the reverse (-) strand.

NurislamSheih commented 5 years ago

Dear Chase, now your script is working as I need))) Thank you a lot and sorry that I disturbed you. Now I can calculate pn/ps for "-" strand genes.

singing-scientist commented 5 years ago

Wonderful, @NurislamSheih ! I will close the thread now. If you have any future issues, please feel free to open a new thread.