Closed kellybarr closed 4 years ago
can you send the full command line?
ngsDist --verbose 1 --geno BUOW.Plate1-3.rmrelFinal.baq.maf.glf4.glf --probs --n_sites 2568136 --n_ind 170 --labels BUOW.Plate1-3.rmrelFinal.ind.bamlist --n_threads 8 --out all_birds
can you send me some sample input files (maybe just 10-20 sites)?
How did you generate the GLF? With angsd
? If so, what command did you use?
angsd -bam "$j" -P 8 -ref BUOW.fasta \ -uniqueOnly 1 -remove_bads 1 -baq 1 -C 50 \ -trim 0 -maxDepth 500 -doGlf 4 \ -minMapQ 20 -minQ 30 -minInd 40 -doCounts 1 \ -GL 1 -doMajorMinor 1 -doMaf 1 -skipTriallelic 1 \ -SNP_pval 1e-6 -doGeno 32 -doPost 1 -minMaf 0.05 -out "$outfile".maf.glf4
You are generating a files with 10 GL per site, but ngsDist
needs as input only 3; try using -doGlf 3
.
Hi, just to clarify, you are saying that -doGLf 3 is the only viable -doGLf option when using ANGSD to create the input for this program? I am confused because ngsLD takes files made with -doGLf 2 (e.g. https://github.com/fgvieira/ngsLD/issues/1), and the descriptions of input for ngsLD and ngsDist is almost identical and do not mention a reason why -doGLf 2 should work in one and not the other?
ngsDist:
Input data As input, ngsDist accepts both genotypes, genotype likelihoods (GL) or genotype posterior probabilities (GP). Genotypes >must be input as gziped TSV with one row per site and one column per individual n_sites.n_ind and genotypes coded as [-1, >0, 1, 2]. The file can have a header and an arbitrary number of columns preceeding the actual data (that will all be ignored), >much like the Beagle file format (link). As for GL and GP, ngsDist accepts both gzipd TSV and binary formats, but with 3 >columns per individual 3.n_sites.n_ind and, in the case of binary, the GL/GP coded as doubles.
ngsLD:
As input, ngsLD accepts both genotypes, genotype likelihoods (GP) or genotype posterior probabilities (GP). Genotypes >must be input as gziped TSV with one row per site and one column per individual n_sites.n_ind and genotypes coded as [-1, >0, 1, 2]. As for GL and GP, ngsLD accepts both gzipd TSV and binary formats, but with 3 columns per individual >3.n_sites.n_ind and, in the case of binary, the GL/GP coded as doubles.
both ngsLD
and ngsDist
accept as input files generated with angsd
-doGlf
option 2 or 3, but not 4 (which is the one you used).
Thanks. Actually it was a different poster using option 4, I had been trying with option 2 (beagle file) and it was telling me that my geno file had the wrong number of sites (even though I the number of sites was correct according to the number of lines in the unzipped beagle file), so I wondered if it just couldn't take the beagle format. I still have not gotten it to work with a beagle file (produced by ANGSD's -doGLf 2) because of this error about the wrong number of sites, but that's really a separate issue.
When you counted the lines of the beagle file, did you subtract the header? If so, can you then send me the beagle file you are using?
Ahhh, counting the header was the problem for me. Thanks!!
No problem... :smile: I'll close it now, but feel free to reopen if the problem persists.
Today I observed that I get this weird output, only nan, when I run ngsDist with --tot_sites 1, as discussed in this issue. However, when I run the same code without the --tot_sites option, I get normal output. I can see that is not what the original poster was doing, but decided to comment here in case it's helpful to know this, since the effect was the same -- a matrix of nan's.
Thanks Teresa for reporting the issue.
Indeed there was an issue that has now been corrected. Just specify --evol_model 0
to get the raw p-distance (by defalt ngsDist
outputs a log transformed p-distance)
Thansk,
Finally ran the program successfully by closely following the example. It ran with the following output:
==> Analysis will be run in 14365 combinations ==> GZIP input file (never BINARY) ==> Reading labels ==> Reading genotype data ==> Analyzing full dataset... ==> Freeing memory... Done!
Then the resulting file looks like this:
sample_name 0.0000000000 -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan
That's just the first lines, but they all look like this. Anyone have any clue why this is happening?