Rosemeis / pcangsd

Framework for analyzing low depth NGS data in heterogeneous populations using PCA.
GNU General Public License v3.0
46 stars 11 forks source link

dropping samples function? #31

Open devonorourke opened 4 years ago

devonorourke commented 4 years ago

Hi, Thanks for developing and supporting a great program. I am running V0.982 and noticed there was an option called filter that I supposed might drop a list of samples set up in an associated file. Perhaps that's not the intention. After successfully running the pcangsd.py script with my full set of samples, there were three particular samples that appeared as outliers. I wanted to re-run the program after dropping these three samples and see if any of the resulting admixture and covariance data might change (and I thought that the filter function might be the way to do that). I created a file called "droplist" and ran this command:

python $PCANGSD -beagle $BEAGLEFILE -threads 8 -o LUvcfSnps_drop1 -admix -inbreed 1 -filter $DROPFILE

where the $DROPFILE file contained this information:

tcap006
tcap027
tcap188

These three strings represent three of the sample names in the first line of the beagle file.

Alas, the particular error message I received in this case was as follows:

PCAngsd 0.982
Using 8 thread(s)

Traceback (most recent call last):
  File "/projects/foster_lab/pkgs/pcangsd/pcangsd.py", line 122, in <module>
    filterI = np.load(args.filter).astype(bool)
  File "/packages/python/anaconda/latest/lib/python2.7/site-packages/numpy/lib/npyio.py", line 416, in load
    "Failed to interpret file %s as a pickle" % repr(file))
IOError: Failed to interpret file '/scratch/dro49/myluwork/popgen/angsd/mylu/pcAngsd/dropfile1' as a pickle

Thanks for any information you can provide. Devon

Rosemeis commented 4 years ago

Hi Devon,

Sorry that the description of the feature is a bit cryptic. But you will need to provide a integer numpy array, where individuals you want to keep is set as 1 and 0 otherwise. You will need to use the numpy.save command in python to save the array before giving it to PCAngsd. I will try to make it easier in the next update.

Best, Jonas

devonorourke commented 4 years ago

Interesting. Just to clarify, if I have a .beagle file with 100 samples, it would be 303 columns (marker, allele1, allele 2, ind0, ind0, ind0 ... ind100, ind100, ind100), right? Would the resulting integer numpy array you're talking about then be just 300 columns long?

I ended up just using zcat my.beagle.gz | cut -f ... and spliced out the columns I didn't want. Not ideal and super prone to silly mistakes that I tend to make, but the program ran with that modified file just fine.

As a related feature request, would it be possible for the resulting .cov and .npy files to have row names that correspond to the sample inputs? Perhaps the output matrices don't allow for that kind of specificity.

Thanks anyhow!

Rosemeis commented 4 years ago

Yes exactly, it would have to be a integer numpy array of shape 300.

Feel free to reach out again! :-)

/Jonas