fgvieira / ngsDist

Estimation of pairwise distances under a probabilistic framework
GNU General Public License v3.0
10 stars 7 forks source link

Weird output, only nan #11

Closed kellybarr closed 4 years ago

kellybarr commented 4 years ago

Finally ran the program successfully by closely following the example. It ran with the following output:

==> Analysis will be run in 14365 combinations ==> GZIP input file (never BINARY) ==> Reading labels ==> Reading genotype data ==> Analyzing full dataset... ==> Freeing memory... Done!

Then the resulting file looks like this:

sample_name 0.0000000000 -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan

That's just the first lines, but they all look like this. Anyone have any clue why this is happening?

fgvieira commented 4 years ago

can you send the full command line?

kellybarr commented 4 years ago

ngsDist --verbose 1 --geno BUOW.Plate1-3.rmrelFinal.baq.maf.glf4.glf --probs --n_sites 2568136 --n_ind 170 --labels BUOW.Plate1-3.rmrelFinal.ind.bamlist --n_threads 8 --out all_birds

fgvieira commented 4 years ago

can you send me some sample input files (maybe just 10-20 sites)?

kellybarr commented 4 years ago

sample.names.zip sample.glf.zip

fgvieira commented 4 years ago

How did you generate the GLF? With angsd? If so, what command did you use?

kellybarr commented 4 years ago

angsd -bam "$j" -P 8 -ref BUOW.fasta \ -uniqueOnly 1 -remove_bads 1 -baq 1 -C 50 \ -trim 0 -maxDepth 500 -doGlf 4 \ -minMapQ 20 -minQ 30 -minInd 40 -doCounts 1 \ -GL 1 -doMajorMinor 1 -doMaf 1 -skipTriallelic 1 \ -SNP_pval 1e-6 -doGeno 32 -doPost 1 -minMaf 0.05 -out "$outfile".maf.glf4

fgvieira commented 4 years ago

You are generating a files with 10 GL per site, but ngsDist needs as input only 3; try using -doGlf 3.

TeresaPegan commented 4 years ago

Hi, just to clarify, you are saying that -doGLf 3 is the only viable -doGLf option when using ANGSD to create the input for this program? I am confused because ngsLD takes files made with -doGLf 2 (e.g. https://github.com/fgvieira/ngsLD/issues/1), and the descriptions of input for ngsLD and ngsDist is almost identical and do not mention a reason why -doGLf 2 should work in one and not the other?

ngsDist:

Input data As input, ngsDist accepts both genotypes, genotype likelihoods (GL) or genotype posterior probabilities (GP). Genotypes >must be input as gziped TSV with one row per site and one column per individual n_sites.n_ind and genotypes coded as [-1, >0, 1, 2]. The file can have a header and an arbitrary number of columns preceeding the actual data (that will all be ignored), >much like the Beagle file format (link). As for GL and GP, ngsDist accepts both gzipd TSV and binary formats, but with 3 >columns per individual 3.n_sites.n_ind and, in the case of binary, the GL/GP coded as doubles.

ngsLD:

As input, ngsLD accepts both genotypes, genotype likelihoods (GP) or genotype posterior probabilities (GP). Genotypes >must be input as gziped TSV with one row per site and one column per individual n_sites.n_ind and genotypes coded as [-1, >0, 1, 2]. As for GL and GP, ngsLD accepts both gzipd TSV and binary formats, but with 3 columns per individual >3.n_sites.n_ind and, in the case of binary, the GL/GP coded as doubles.

fgvieira commented 4 years ago

both ngsLD and ngsDist accept as input files generated with angsd -doGlf option 2 or 3, but not 4 (which is the one you used).

TeresaPegan commented 4 years ago

Thanks. Actually it was a different poster using option 4, I had been trying with option 2 (beagle file) and it was telling me that my geno file had the wrong number of sites (even though I the number of sites was correct according to the number of lines in the unzipped beagle file), so I wondered if it just couldn't take the beagle format. I still have not gotten it to work with a beagle file (produced by ANGSD's -doGLf 2) because of this error about the wrong number of sites, but that's really a separate issue.

fgvieira commented 4 years ago

When you counted the lines of the beagle file, did you subtract the header? If so, can you then send me the beagle file you are using?

TeresaPegan commented 4 years ago

Ahhh, counting the header was the problem for me. Thanks!!

fgvieira commented 4 years ago

No problem... :smile: I'll close it now, but feel free to reopen if the problem persists.

TeresaPegan commented 3 years ago

Today I observed that I get this weird output, only nan, when I run ngsDist with --tot_sites 1, as discussed in this issue. However, when I run the same code without the --tot_sites option, I get normal output. I can see that is not what the original poster was doing, but decided to comment here in case it's helpful to know this, since the effect was the same -- a matrix of nan's.

fgvieira commented 3 years ago

Thanks Teresa for reporting the issue.

Indeed there was an issue that has now been corrected. Just specify --evol_model 0 to get the raw p-distance (by defalt ngsDist outputs a log transformed p-distance)

Thansk,