Closed biolevol closed 4 weeks ago
Thanks for the positive feedback, glad you like ipyrad! As for your question, in fact we used to write out the full vcf including invariant sites by default. This file got really really huge, typically, so eventually we switched to only writing out variable sites by default, and allowing including all sites as an option (this was pre v0.9). In version 0.9 we redid the output writing code, and never reimplemented the "full vcf" output as an option, because we never used it personally, and we never heard of anybody using it, so we didn't prioritize reimplementing this.
I will leave this issue open, but honestly I don't see this getting done any time soon, unless one or the other of Deren or myself needs it for a project. It would be a medium sized job, and I don't have many spare cycles at the moment.
Out of curiosity, what is the use case you're interested in? If it seems like a general use case that more people would take advantage of then this would bump the priority a bit.
Dear Isaac, Thank you so much for this prompt reply! We would like to merge our ddRAD obtained vcf with another vcf from WGS using aDNA. Due to the nature of the aDNA there is a plethora of missing sites and similarly with ddRAD we have only a fraction of the genome represented. Thus, using the invariant sites we were hoping to keep more sites across the genome that are found in the two datasets. Similarly, having the invariant sites would allow us to truly distinguish missing sites from invariant sites and fine-tune our tajima's D, fst and pi nucleotide variation estimations (pixy is a great tool for this kind of analysis but requires invariant sites). Thank you again for your time!
Dear Isaac,
Could you please share with us the part of the code that would require changing/adjusting for the full vcf option and we could try to reimplement the full vcf, if that's ok with you. Thank you very much in advance.
Yes, of course. In ipyrad/assemble/write_outputs.py there is the def build_vcf(data, chunksize=1000):
function that begins on line 2351. On line 2261 it opens this file data.snps_database
which contains only variable sites. The file you would have to open and pull data out of is data.seqs_database
, as it contains the full sequence data. I imagine this will involve a substantial amount of work.
Another option is to go back to the 0.7.30 minor version, which had a substantially simpler process for writing the vcf output, and which also (I believe) still had the ability to write out the full vcf file if you passed it the V
argument for the output_formats parameter.
Good luck, and let me know if you have any questions.
Thank you very much! I will keep you posted for any updates.
Here is a workaround to get the full vcf file from an ipyrad loci file (this can be useful for pixy, which I guess requires vcf files with all sites including monomorphic).
Download this file and rename it to loci2vcf.py
loci2vcf.py.txt
Run the python script like this:
python loci2vcf.py wat.loci out.vcf samps.txt False
.loci
filesamps.txt
file needs to be a file containing a list of all sample names in the vcf, one sample name per row.True
or False
(defaults to True). True means write all sites, False means only write snps.I tested this using the most recent .loci file from the simulated data using ipyrad v0.9.93 and python3.10. It should work for you but no promises.
closing this issue because it seems resolved
Hello, Let me first thank you for this great tool. I was wondering if there is any way I could obtain all invariant sites that are actually covered by the ddRAD reads in our final vcf file, something similar to "--include-non-variant-sites/ -all-sites" from the GATK's GenotypeGVCFs argument. Thank you very much in advance!