Poppunk ValueError (sklearn) when generating database

BioMinnie commented 3 years ago

Versions

Poppunk version 2.4.0 pp-sketchlib version 1.7.1 scikit-learn verson 0.24.2 python version 3.9.4

Command used and output returned

poppunk --create-db --output neisseria_db_v2 --r-files input_isolates.tab --threads 12

Data input example:
isolate1      /isolate1/isolate1_R1_001.fastq.gz   /isolate1/isolate1_R2_001.fastq.gz
isolate2      /isolate2/isolate2_R1_001.fastq.gz   /isolate2/isolate2_R2_001.fastq.gz
isolate3      /isolate3/isolate3_R1_001.fastq.gz   /isolate3/isolate3_R2_001.fastq.gz

Describe the bug

Program returns the below error, the qcreport.txt file is empty and the database isnt correct (ie if I use the created database to generate lineages using poppunk --fit-model, all the isolates collapse into the same cluster (despite being global, diverse isolates from many diverse lineages). Was just wondering if/how this can be fixed?


PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
        (with backend: sketchlib v1.7.1
         sketchlib: /home/mashcroft/anaconda3/envs/poppunk/lib/python3.9/site-packages/pp_sketchlib.cpython-39-x86_64-linux-gnu.so)

Graph-tools OpenMP parallelisation enabled: with 12 threads
Mode: Building new database from input sequences
Sketching 5182 genomes using 12 thread(s)
Writing sketches to file
Calculating random match chances using Monte Carlo
Calculating distances using 12 thread(s)
Selected type isolate for distance QC is isolate3
/home/mashcroft/anaconda3/envs/poppunk/lib/python3.9/site-packages/PopPUNK/plot.py:57: RuntimeWarning: invalid value encountered in true_divide
  X /= scale
Traceback (most recent call last):
  File "/home/mashcroft/anaconda3/envs/poppunk/bin/poppunk", line 11, in <module>
    sys.exit(main())
  File "/home/mashcroft/anaconda3/envs/poppunk/lib/python3.9/site-packages/PopPUNK/__main__.py", line 337, in main
    plot_scatter(distMat,
  File "/home/mashcroft/anaconda3/envs/poppunk/lib/python3.9/site-packages/PopPUNK/plot.py", line 66, in plot_scatter
    kde.fit(X)
  File "/home/mashcroft/anaconda3/envs/poppunk/lib/python3.9/site-packages/sklearn/neighbors/_kde.py", line 165, in fit
    X = self._validate_data(X, order='C', dtype=DTYPE)
  File "/home/mashcroft/anaconda3/envs/poppunk/lib/python3.9/site-packages/sklearn/base.py", line 421, in _validate_data
    X = check_array(X, **check_params)
  File "/home/mashcroft/anaconda3/envs/poppunk/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/home/mashcroft/anaconda3/envs/poppunk/lib/python3.9/site-packages/sklearn/utils/validation.py", line 720, in check_array
    _assert_all_finite(array,
  File "/home/mashcroft/anaconda3/envs/poppunk/lib/python3.9/site-packages/sklearn/utils/validation.py", line 103, in _assert_all_finite
    raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

johnlees commented 3 years ago

Looks like there's an issue with the distance estimation. I wonder if it's something going wrong with the reads, but haven't seen this issue before. Is your .h5 file small enough to share with me? Alternatively, could you run:

poppunk_extract_distances.py --distances neisseria_db_v2/neisseria_db_v2.dists --output pp_dists.tmp
awk '$3 >=1 || $4 >= 1 {print $0}' pp_dists.tmp

to try and find any large distances (or in that output see if any look odd, are NaN, infinity etc)

BioMinnie commented 3 years ago

The .h5 file is 477 Mb. I ran the above and there was no output. However if I less the pp_dists.tmp file, it looks like:

isolate1    isolate1    0.0    0.0
isolate1    isolate2    0.0    0.0
isolate1    isolate3    0.0    0.0
isolate1    isolate4    0.0    0.0
.....

So it looks like it isnt generating the distances properly? I have used the reads previously to generate a phylogenetic tree (Snippy), so I dont think there is anything wrong with the read files

johnlees commented 3 years ago

Right, definitely some issue with the poppunk sketch in that case. Would you be able to sketch just the first five of your isolates, and send me the .h5 file? I'll see if I can find anything wrong.

It could be filtering out k-mers at too high of a frequency. So you could try adding --min-kmer-count 1 --exact-count to your options (again maybe just with the first five, to speed things up).

And finally, if none of this helps, are any of the reads publicly available anywhere?

BioMinnie commented 3 years ago

Below is a link to the .h5 file for just 5 isolates

(https://www.dropbox.com/s/8553g2z9enpdaqd/test_db.h5?dl=0)

I also tried running the command

poppunk --create-db --output test_db_2 --r-files test_isolates_location.tab --threads 12 --min-kmer-count 1 --exact-count

However, I got the exact same error message (as above) both times and both .h5 files were the same size

johnlees commented 3 years ago

Hi,

First I had a look at your database with:

scripts/poppunk_db_info.py --list-samples test_db.h5
PopPUNK database:       test_db.h5
Sketch version:         62027981c4bfe35935d52efabb4e3b2c62317c35
Number of samples:      5
K-mer sizes:            13,17,21,25,29
Sketch size:            9984
Contains random matches:    True
Codon phased seeds:     False

name    base_frequencies    length  missing_bases
AUSMDU00055504  A:0.240,C:0.261,G:0.264,T:0.235 36333208    64236
AUSMDU00055505  A:0.237,C:0.264,G:0.266,T:0.233 26938217    46510
AUSMDU00055506  A:0.243,C:0.259,G:0.260,T:0.239 37912012    63856
AUSMDU00055507  A:0.241,C:0.263,G:0.263,T:0.233 33916172    36518
AUSMDU00055508  A:0.238,C:0.264,G:0.265,T:0.233 25795809    58492

Which all looked fine. I also had a look at the sketches themselves with h5dump to make sure they weren't identical – this also looked ok. Then I had a look at the distances (Jaccard and core/acccessory) using poppunk_sketch directly:

poppunk_sketch --query --ref-db test_db --query-db test_db --read-k --print --jaccard
Query   Reference   13  17  21  25  29
AUSMDU00055504  AUSMDU00055505  0.0 0.13672271  0.107672274 0.10076122  0.08994391
AUSMDU00055504  AUSMDU00055506  0.0043619485    0.13861641  0.109375    0.106770836 0.095552884
AUSMDU00055504  AUSMDU00055507  0.0 0.1488614   0.119391024 0.10947516  0.10046074
AUSMDU00055504  AUSMDU00055508  0.0077014174    0.1477672   0.110276446 0.103665866 0.093549676
AUSMDU00055505  AUSMDU00055506  0.0 0.13302922  0.09915865  0.0970552860.085737176
AUSMDU00055505  AUSMDU00055507  0.0 0.1296944   0.107672274 0.09134615  0.088141024
AUSMDU00055505  AUSMDU00055508  0.0 0.1424671   0.1088742   0.10316507  0.09054487
AUSMDU00055506  AUSMDU00055507  0.0 0.1369099   0.10506811  0.09885817  0.08954327
AUSMDU00055506  AUSMDU00055508  0.0 0.13463546  0.10737179  0.09405048  0.086538464
AUSMDU00055507  AUSMDU00055508  0.0 0.13752596  0.105669074 0.0994591340.08904247

Which shows no matches at k = 13 for most genomes, which is likely why you're getting zeros in the core/accessory distances.

So, I think to fix this, run your sketch again adding --min-k 15 --k-step 3 --plot-fit 5. See if those plots look ok. You may need to increase the min-k to 17 if you are still having this issue. You could also add the exact counter argument again, this can be slightly better at removing sequencing errors.

johnlees commented 3 years ago

This section may be useful: https://poppunk.readthedocs.io/en/latest/sketching.html#choosing-the-right-k-mer-lengths

The genome length estimation is not great from reads as you can see, guessing you're expecting around 2Mb? (but does get more accurate using more k-mer lengths). This might be causing the random match chance estimation to get things wrong at the lower k range.

BioMinnie commented 3 years ago

Thank you for all your help here, indeed it was the kmer size that was the problem and when I increased it to 15, everything worked.

bacpop / PopPUNK

Poppunk ValueError (sklearn) when generating database #167