Recover all sites (invariant sites) in vcf

dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets

http://ipyrad.readthedocs.io

GNU General Public License v3.0

72 stars 40 forks source link

Recover all sites (invariant sites) in vcf #479

Closed biolevol closed 4 weeks ago

biolevol commented 2 years ago

Hello, Let me first thank you for this great tool. I was wondering if there is any way I could obtain all invariant sites that are actually covered by the ddRAD reads in our final vcf file, something similar to "--include-non-variant-sites/ -all-sites" from the GATK's GenotypeGVCFs argument. Thank you very much in advance!

isaacovercast commented 2 years ago

Thanks for the positive feedback, glad you like ipyrad! As for your question, in fact we used to write out the full vcf including invariant sites by default. This file got really really huge, typically, so eventually we switched to only writing out variable sites by default, and allowing including all sites as an option (this was pre v0.9). In version 0.9 we redid the output writing code, and never reimplemented the "full vcf" output as an option, because we never used it personally, and we never heard of anybody using it, so we didn't prioritize reimplementing this.

I will leave this issue open, but honestly I don't see this getting done any time soon, unless one or the other of Deren or myself needs it for a project. It would be a medium sized job, and I don't have many spare cycles at the moment.

Out of curiosity, what is the use case you're interested in? If it seems like a general use case that more people would take advantage of then this would bump the priority a bit.

biolevol commented 2 years ago

Dear Isaac, Thank you so much for this prompt reply! We would like to merge our ddRAD obtained vcf with another vcf from WGS using aDNA. Due to the nature of the aDNA there is a plethora of missing sites and similarly with ddRAD we have only a fraction of the genome represented. Thus, using the invariant sites we were hoping to keep more sites across the genome that are found in the two datasets. Similarly, having the invariant sites would allow us to truly distinguish missing sites from invariant sites and fine-tune our tajima's D, fst and pi nucleotide variation estimations (pixy is a great tool for this kind of analysis but requires invariant sites). Thank you again for your time!

biolevol commented 2 years ago

Dear Isaac,

Could you please share with us the part of the code that would require changing/adjusting for the full vcf option and we could try to reimplement the full vcf, if that's ok with you. Thank you very much in advance.

isaacovercast commented 2 years ago

Yes, of course. In ipyrad/assemble/write_outputs.py there is the def build_vcf(data, chunksize=1000): function that begins on line 2351. On line 2261 it opens this file data.snps_database which contains only variable sites. The file you would have to open and pull data out of is data.seqs_database, as it contains the full sequence data. I imagine this will involve a substantial amount of work.

Another option is to go back to the 0.7.30 minor version, which had a substantially simpler process for writing the vcf output, and which also (I believe) still had the ability to write out the full vcf file if you passed it the V argument for the output_formats parameter.

Good luck, and let me know if you have any questions.

biolevol commented 2 years ago

Thank you very much! I will keep you posted for any updates.

isaacovercast commented 1 year ago

Here is a workaround to get the full vcf file from an ipyrad loci file (this can be useful for pixy, which I guess requires vcf files with all sites including monomorphic).

Download this file and rename it to loci2vcf.py loci2vcf.py.txt

Run the python script like this:

python loci2vcf.py wat.loci out.vcf samps.txt False

wat.loci should be replaced with the full path to your .loci file
out.vcf is the name of the output vcf file to write, you can call this whatever you want
The samps.txt file needs to be a file containing a list of all sample names in the vcf, one sample name per row.
The last argument tells whether to write the full vcf (with monomorphic sites). It should be True or False (defaults to True). True means write all sites, False means only write snps.

I tested this using the most recent .loci file from the simulated data using ipyrad v0.9.93 and python3.10. It should work for you but no promises.

isaacovercast commented 4 weeks ago

closing this issue because it seems resolved