dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
72 stars 40 forks source link

Refseq mapping - consens_se.py paralog filtering? #37

Closed isaacovercast closed 5 years ago

isaacovercast commented 8 years ago

consens_se.nfilter3() filters out loci with number of alleles > ploidy to protect against aligning paralogs. Do we want to continue this practice even if we have aligned reads that are known to be orthologous? If the anser to this is 'no', then this introduces a bunch of questions. If the answer is 'yes' then you can just close this issue, nbd.

isaacovercast commented 8 years ago

Results of applying stringent paralog filtering from real lizard data. Strict filtering of paralogs only effects ~2% of reads. This is ddRAD data of relatively deeply diverged lizards, YMMV but it doesn't seem worth losing a ton of sleep over.

dereneaton commented 8 years ago

I've been thinking about this as well. Paralogs are likely a bigger problem in some organisms than others. The ploidy filter (consens_se.nfilter3()) is particularly useful in filtering out repetitive elements. Some thoughts:

1) We could flag paralog clusters instead of removing them. This way they would still be included in the across-sample clustering at step6 for denovo data, or in the mapped clusters when refmapping. In that case they could be excluded in step7 instead of step5. This consolidates all paralog related filtering to occur only at step7, using both max_shared_heterozygosity and ploidy....

2) However, I like ploidy being a filter that applies to individuals rather than across-sample clusters (loci), since it is possible you would want to apply a haploid filter to some individuals and a diploid filter to others in some analyses.

isaacovercast commented 8 years ago

Looking at consens.se again i can see the paralog filtering is useful here. ploidy=2 will effectively filter out reference mapped loci with multi-allelic sites, which we could market as a "feature" not a "bug" ;p

considering that all downstream analyses i can think of want you to toss out multi-allelic sites anyway, and considering the proportion of multi-allelic refmapped loci is likely relatively small i think it's probably best just to leave filtering how it is, for now.

isaacovercast commented 5 years ago

Old news.