bacpop / PopPUNK

PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)
https://www.bacpop.org/poppunk
Apache License 2.0
88 stars 18 forks source link

--prop-n flag and ambiguous bases #207

Closed jennahamlin closed 2 years ago

jennahamlin commented 2 years ago

Versions: poppunk 2.4.0 pp-sketchlib 2.0.0

Command used:

poppunk --create-db --r-files input.tsv --min-k 13 --max-k 29 --k-step 2 --sketch-size 1000000 --min-kmer-count 140 --qc-filter continue --retain-failures --plot-fit 10 --prop-n 0.8 --output minKmer140_sketch1000000 --threads 16

Output returned:

/scicomp/home-pure/ptx4/miniconda3/envs/pp_env/lib/python3.9/site-packages/PopPUNK/sketchlib.py:778: RuntimeWarning: divide b
y zero encountered in log
  args=(klist, np.log(pairwise)),
Fitting k-mer curve failed: Residuals are not finite in the initial point.
With mash input [0.    ,0.0941,0.1367,0.1483,0.1559,0.1681,0.1767,0.1859,0.1952]
Check for low quality input genomes

Output returned:

C166    Ambiguous sequence too high
C167    Ambiguous sequence too high
C168    Ambiguous sequence too high

Describe the bug Unclear to me if the flag --prop-n is dealing with the ambiguous bases request as the qc report states 'Ambiguous sequence too high'. I set --prop-n to 80% but still gives same message when I do include --prop-n and when I don't. Additionally, I don't believe most of my data (n =226 out of 251) has that many ambiguous sequences.

johnlees commented 2 years ago

Is this with read data or assemblies?

(I'd also have a look at your k-mer range and plots to make sure they look ok, it seems like you might want to increase the min-k value.)

jennahamlin commented 2 years ago

Yes, these are reads and when I create the database using assemblies, the issue of ambiguous bases goes away, all samples pass QC filters, and the plot fits look okay.

I still find it odd that so many samples would be pruned if I had used read data.

Thank you

johnlees commented 2 years ago

Seems like we probably have some issue counting Ns in reads, I wonder if it might be that we're dividing by expected sequence length rather than total bases in the reads. I'll look into it

johnlees commented 2 years ago

Indeed, the denominator in this check is wrong for reads. sketchlib v2.0.1 & poppunk 2.5.0 onwards will save whether a sample was reads to the database, and if so the --prop-n cutoff will not be used.

For lower versions, the fix is to set --prop-n to a very large value, or turn the QC off.