Rosemeis / pcangsd

Framework for analyzing low depth NGS data in heterogeneous populations using PCA.
GNU General Public License v3.0
46 stars 11 forks source link

PCAngsd detects one fewer sites than it's given #64

Open jtoy7 opened 1 year ago

jtoy7 commented 1 year ago

Hi. I'm running PCAngsd on a beagle file that has 6,711,262 sites. For some reason, whenever I run it, it only detects 6,711,261 sites. This becomes a larger problem when I trying to analyze the selection scan output files and identify sites. Because the sites_save function only outputs an unlabeled boolean, I have no idea which site is missing and therefore cannot match the 1s and 0s to corresponding site names. In otherwords, I don't know how to match a .sites list of length 6,711,261 to a list of the original 6,711,262 sites. Any idea why this is happening?

Rosemeis commented 1 year ago

Hi,

How was the beagle file created?

Best, Jonas

jtoy7 commented 1 year ago

Hi Jonas. The beagle file was originally created as an output of an ANGSD run. The only change I made to it was filtering for a list of unlinked SNPs after LD pruning with ngsld. The command I ran to do this was:

zgrep -Fw -f $BASEDIR/results/ngsld/ALL_maxd_1sd_unlinked_combined.id.beagleformat $BASEDIR/results/gl_calling/ALL_maxd_1sd.beagle.gz | gzip > $BASEDIR/results/ngsld/ALL_maxd_1sd_LDpruned.beagle.gz

where: ALL_maxd_1sd_unlinked_combined.id.beagleformat = list of LD-pruned SNPs ALL_maxd_1sd.beagle.gz = original ANGSD output beagle file

Rosemeis commented 1 year ago

I think the problem is that you do not include the header of the original ANGSD beagle file when filtering. Therefore when reading the beagle file in pcangsd, the first line is removed.

Hope that makes sense otherwise let me know.

Best, Jonas

jtoy7 commented 1 year ago

I actually just realized this myself. I was working off of examples that explicitly removed headers from the beagle file so I did the same, but I see that your example files have headers. Did past versions require headers to be removed? Thanks for the help. I will run it again with headers and check back in.