dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
72 stars 40 forks source link

Problem with [min_samples_locus] in reference analysis #398

Closed edgardomortiz closed 4 years ago

edgardomortiz commented 4 years ago

Hello, I performed a ddrad reference analysis with v.0.9.42 and setting min_samples_locus=4 produces loci with at least 2 samples (setting it to 8 produces loci with at least 6 samples and so on, always -2). So, to get loci with at least 4 samples I used min_samples_locus=6, is the _stats.txt file correct despite this behavior?

Edgardo

isaacovercast commented 4 years ago

Hi Edgardo, How are you determining sample depth for loci if not with the stats file? Are you looking at the .loci file? Or the vcf? Which file are you looking at? -isaac

On Wed, Feb 26, 2020 at 4:19 PM Edgardo M. Ortiz notifications@github.com wrote:

Hello, I performed a ddrad reference analysis with v.0.9.42 and setting min_samples_locus to 4 produces loci with at least 2 samples (setting it to 8 produces loci with at least 6 samples and so on, always -2). So, to get loci with at least 4 samples used min_samples_locus=6, is the _stats.txt file correct despite this behavior?

Edgardo

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dereneaton/ipyrad/issues/398?email_source=notifications&email_token=ABNSXP2ZN3GVMGFS5AW3SODRE2CBZA5CNFSM4K4H5DUKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IQPZUQA, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNSXP6JBDVU5NZBHQLURBTRE2CBZANCNFSM4K4H5DUA .

edgardomortiz commented 4 years ago

Hi Isaac, I examined the .vcf, .alleles, and .loci files. I didn't check other formats so far.

brpark29 commented 4 years ago

Sorry to chime in, but I've seen sites with <min_samples_locus make their way into the vcf files.

From what I gather, these lower coverage sites are derived from loci with the correct sample coverage, but are located in a messy bit of a sequence in a locus (e.g., towards the ends). I haven't looked into this thoroughly, but I just filter these sites out in vcftools.

edgardomortiz commented 4 years ago

@brpark29 yes I have observed that as well, I think this -2 difference is more systematic though, especially when looking at the .loci and .alleles files. It may be related to this bit of code: https://github.com/dereneaton/ipyrad/blob/958b5d73e489a00bbb8f31cf576a0780223dcd1c/ipyrad/assemble/write_outputs.py#L644-L647 Perhaps the solution is: self.minsamp += 1 ??

isaacovercast commented 4 years ago

@edgardomortiz That's exactly the problem. I fixed it 11e34c3, will push a new tag so bioconda package will be updated. Thanks for reporting AND figuring it out!