Closed HarryHung closed 7 months ago
Our error message isn't working properly here and should be giving you the list of missing samples, which we need to fix.
But ultimately this is being caused by a mismatch in samples between clusters, network and database. Can you check the number in each of these inputs? You can also use poppunk_info
to help with this, see https://poppunk.readthedocs.io/en/latest/sketching.html#viewing-information-about-a-database
Thanks for getting back to me John.
The output of poppunk_info
is
Graph-tools OpenMP parallelisation enabled: with 1 threads
PopPUNK database: GPS_v8
Sketch version: c42cd0e22f6ef6d5c9a2900fa16367e096519170
Number of samples: 42482
K-mer sizes: 14,17,20,23,26,29
Sketch size: 9984
Contains random matches: True
Codon phased seeds: False
There is 42483 lines in GPS_v8_clusters.csv
(i.e. 42482 samples minus the header)
There is 42483 lines in GPS_v8_external_clusters.csv
(i.e. 42482 samples minus the header)
And according to the above error output of PopPUNK, it seems the network has 42482 samples as well:
Network loaded: 42482 samples
As I did not create this version of the database, I am wondering is there a way to deduce whether it is a database issue, or a general PopPUNK bug?
Thanks for checking Harry.
As I did not create this version of the database, I am wondering is there a way to deduce whether it is a database issue, or a general PopPUNK bug?
It's hard to say I think β of course in general assignment works in this version (it's in the CI) β I suspect it's a database/files issue, but that's not to say that PopPUNK isn't at fault for picking up a bad default or being unhelpful with the error.
Any idea where the 42209 in the original error is coming from? How many lines does the .refs
file in the model directory have? Have you/Steph tried using --model-dir
to point at the original fit?
It seems you have a good hunch on the root cause.
The line count of GPS_v8.refs
is indeed 42209.
I have checked with Steph and she ran the command (no --model-dir
)
poppunk_assign --db GPS_v6 --distances GPS_v6/GPS_v6.dists --query query.txt --external-clustering GPS_v6_external_clusters.csv --update-db --output GPS_v8
It seems in this case, --model-dir
is necessary to make the updated DB works?
Should I ask her to rerun the command as
poppunk_assign --db GPS_v6 --distances GPS_v6/GPS_v6.dists --model-dir GPS_v6 --query query.txt --external-clustering GPS_v6_external_clusters.csv --update-db --output GPS_v8
Or is there a simpler fix?
If I understand the issue correctly, I am wondering why --model-dir
is not a mandatory option when --update-db
is used. Or is there a use case that require the absence of --model-dir
?
There seems to be an explanation at https://poppunk.readthedocs.io/en/latest/query_assignment.html#updating-the-database, but I don't think I really understand that to be honest.
I think the command to try should be:
python3 PopPUNK/poppunk_assign-runner.py --db ../../sl28/PopPUNK2/GPS_v8 --query qfile.txt --external-clustering ../../sl28/PopPUNK2/GPS_v8_external_clusters.csv --model-dir ../../sl28/PopPUNK2/GPS_v6 --output output
Let's see if that works.
I think our expected use is that everything for the DB is in one folder and not versioned like this, so that when you point to the DB directory it has everything in it that's needed.
The easier thing to do is just copy the fit from v6 over to the v8 directory, but I appreciate which files are needed isn't clear.
If this is indeed the issue I will try and update the documentation to be clearer.
Thank you John!
I misunderstood the usage of --model-dir
, I mistakenly thought it was associated to the --update-db
option, now I can see it is asking poppunk_assign
to use an alternative list of references.
I can confirm the following command can be completed successfully
python3 PopPUNK/poppunk_assign-runner.py --db ../../sl28/PopPUNK2/GPS_v8 --query qfile.txt --external-clustering ../../sl28/PopPUNK2/GPS_v8_external_clusters.csv --model-dir ../../sl28/PopPUNK2/GPS_v6 --output output --threads 32
So if I want to have everything in a single database directory and forgo the need to use --model-dir
when running poppunk_assign
, which files exactly should I copy (and change v6 in the file names to v8 I suppose) from v6 to v8 directory? And will such process result in the loss of information gained by updating the database via --update-db
?
If we want to maintain versioned database, the best way forward seems to be copy the database directory, and in the newly created directory run the below command?
poppunk_assign --db GPS --query query.txt --update-db --output GPS --overwrite --external-clustering GPS_external_clusters.csv
On another note, I notice apart from the files listed in https://poppunk.readthedocs.io/en/latest/query_assignment.html#downloading-a-database, there are a few extra files in the database directory that are not mentioned in the doc:
GPS_v8.refs.dists.npy
GPS_v8.refs.dists.pkl
GPS_v8_refs_graph.gt
GPS_v8.refs.h5
Are they neccessary? As they double the size of the database directory (there are non-.refs
counterpart of them in the directory)
.
βββ [583K] GPS_v8_clusters.csv
βββ [6.7G] GPS_v8.dists.npy
βββ [542K] GPS_v8.dists.pkl
βββ [589K] GPS_v8_external_clusters.csv
βββ [1.1K] GPS_v8_fit.npz
βββ [ 26] GPS_v8_fit.pkl
βββ [ 20M] GPS_v8_graph.gt
βββ [4.3G] GPS_v8.h5
βββ [456K] GPS_v8.refs
βββ [6.6G] GPS_v8.refs.dists.npy
βββ [539K] GPS_v8.refs.dists.pkl
βββ [ 20M] GPS_v8_refs_graph.gt
βββ [4.2G] GPS_v8.refs.h5
βββ [ 15K] GPS_v8_unword_clusters.csv
@johnlees
On a side note, I am wondering when running poppunk_assign
, does the loading of a huge database like GPS take up the majority of run time?
Because I am considering whether I should run poppunk_assign
separately on each assembly to increase the stability of my pipeline instead of the current all-samples-in-one-run approach, but will not opt for it if it has a hefty time cost penalty.
So if I want to have everything in a single database directory and forgo the need to use
--model-dir
when runningpoppunk_assign
, which files exactly should I copy (and change v6 in the file names to v8 I suppose) from v6 to v8 directory? And will such process result in the loss of information gained by updating the database via--update-db
?
It would be the _fit.pkl
and _fit.npy
files.
If we want to maintain versioned database, the best way forward seems to be copy the database directory, and in the newly created directory run the below command?
poppunk_assign --db GPS --query query.txt --update-db --output GPS --overwrite --external-clustering GPS_external_clusters.csv
That's one option. You could also output to a new directory without overwrite
On another note, I notice apart from the files listed in https://poppunk.readthedocs.io/en/latest/query_assignment.html#downloading-a-database, there are a few extra files in the database directory that are not mentioned in the doc:
* `GPS_v8.refs.dists.npy` * `GPS_v8.refs.dists.pkl` * `GPS_v8_refs_graph.gt` * `GPS_v8.refs.h5`
Are they neccessary? As they double the size of the database directory (there are non-
.refs
counterpart of them in the directory). βββ [583K] GPS_v8_clusters.csv βββ [6.7G] GPS_v8.dists.npy βββ [542K] GPS_v8.dists.pkl βββ [589K] GPS_v8_external_clusters.csv βββ [1.1K] GPS_v8_fit.npz βββ [ 26] GPS_v8_fit.pkl βββ [ 20M] GPS_v8_graph.gt βββ [4.3G] GPS_v8.h5 βββ [456K] GPS_v8.refs βββ [6.6G] GPS_v8.refs.dists.npy βββ [539K] GPS_v8.refs.dists.pkl βββ [ 20M] GPS_v8_refs_graph.gt βββ [4.2G] GPS_v8.refs.h5 βββ [ 15K] GPS_v8_unword_clusters.csv
I think as long as you keep the GPS_v8.refs
file the rest can be removed
@johnlees On a side note, I am wondering when running
poppunk_assign
, does the loading of a huge database like GPS take up the majority of run time? Because I am considering whether I should runpoppunk_assign
separately on each assembly to increase the stability of my pipeline instead of the current all-samples-in-one-run approach, but will not opt for it if it has a hefty time cost penalty.
Yes I think this will be most of the cost. Have you looked at the --serial
option? That might be useful as that does assignment one-by-one, but only loads the database once.
Thanks for the info!
I will try to consolidate the files into a single directory and/or removing unnecessary files in the published database for assignment, then report back to you.
It seems the full description of --serial
is not available in the PopPUNK documentation. But if I want to run PopPUNK for each sample as an individual job in different containers (which is the preferred way of Nextflow), it seems --serial
would not help.
I think it is fine for now, I would just run PopPUNK for all samples at once and split the results. Will only need to reconsider this approach if someone complains about the stability of PopPUNK (for now, it seems poppunk_assign
is quite stable, if the database itself is okay)
It would be the
_fit.pkl
and_fit.npy
files.
@johnlees
I guess it should be _fit.npz
, as there is no _fit.npy
.
Based on md5sum
, the following have the same content:
GPS_v6_fit.pkl
vs GPS_v8_fit.pkl
GPS_v6_fit.npz
vs GPS_v8_fit.npz
I tried to replace the original GPS_v8_fit.pkl
and GPS_v8_fit.npz
with GPS_v6_fit.pkl
and GPS_v6_fit.npz
(also renamed v6
to v8
) anyway, the same error occurs.
I notice that when I point to GPS_v6
with --model-dir
(while --db
points to GPS_v8
), the network file being loaded are different:
--db
points to GPS_v8
with GPS_v8_fit.pkl
and GPS_v8_fit.npz
replaced, but without --model-dir
:
Loading network from ../poppunk_db/GPS_v8/GPS_v8_graph.gt
Network loaded: 42482 samples
--db
points to GPS_v8
and with --model-dir
points to GPS_v6
:
Loading network from ../poppunk_db/GPS_v6/GPS_v6_graph.gt
Network loaded: 42163 samples
This might be related to when --model-dir
is used, model_prefix
is replaced
if model_dir is not None:
model_prefix = model_dir
subsequently, the replaced model_prefix
is used to read more files
ref_file_name = os.path.join(model_prefix,
os.path.basename(model_prefix) + file_extension_string + ".refs")
therefore, simply replacing two _fit.pkl
and _fit.npz
files do not seem to be enough.
I would be grateful if you could suggest what I should do next. As by replacing files one by one, I am afraid I might revert everything back to GPS_v6
. Or at this point, we should just re-generate v8 using the previously mentioned command with --overwrite
option?
I think I need to do two things here (in #279):
Sounds good to me. Thanks John!
Sorry for bothering you again @johnlees ,
I am wondering is branch i273 of https://github.com/bacpop/PopPUNK/pull/279 ready for testing, or should I wait for it to be merged into master
and test on master
branch instead?
Thanks!
Sorry, there was a difficult issue with one of the dependencies which took a while to resolve. I think i273 should be ok, but I wanted to run a final test before merging it. I won't have time to do that for a while, so I would go ahead and use that branch, it should fix your issue (and if it doesn't then I'll know what to further change!)
Sure. I will test it out and let you know how it goes. Thanks!
Sadly, it is still not working properly.
I have noticed there is another line blocking the --overwrite
option to function properly, and I attempt to fix that in PR https://github.com/bacpop/PopPUNK/pull/282
However, even with that fix in place,
Running
python3 /path/to/poppunk_assign-runner.py --db GPS_v6 --query query_for_v8_update.txt --external-clustering GPS_v6_external_clusters.csv --update-db --overwrite --output GPS_v6 --threads 32
Still leads to the following error:
PopPUNK: assign
(with backend: sketchlib v2.1.1
sketchlib: /path/to/miniconda3/envs/pp_env/lib/python3.10/site-packages/pp_sketchlib.cpython-310-x86_64-linux-gnu.so)
Mode: Assigning clusters of query sequences
Graph-tools OpenMP parallelisation enabled: with 32 threads
Overwriting db: GPS_v6/GPS_v6.h5
Sketching 319 genomes using 32 thread(s)
Progress (CPU): 0 / 319
Progress (CPU): 3 / 319
Progress (CPU): 42 / 319
Progress (CPU): 69 / 319
Progress (CPU): 103 / 319
Progress (CPU): 135 / 319
Progress (CPU): 167 / 319
Progress (CPU): 201 / 319
Progress (CPU): 236 / 319
Progress (CPU): 272 / 319
Progress (CPU): 298 / 319
Progress (CPU): 319 / 319
Writing sketches to file
Loading previously refined model
Completed model loading
Missing sketch: Unable to open the group "/sketches/10050_2#1": (Symbol table) Object not found
Could not find random match chances in database, calculating assuming equal base frequencies
Traceback (most recent call last):
File "/path/to/poppunk_assign-runner.py", line 13, in <module>
main()
File "/path/to/PopPUNK/assign.py", line 211, in main
assign_query(dbFuncs,
File "/path/to/PopPUNK/assign.py", line 307, in assign_query
isolateClustering = assign_query_hdf5(dbFuncs,
File "/path/to/PopPUNK/assign.py", line 474, in assign_query_hdf5
qrDistMat = queryDatabase(rNames = rNames,
File "/path/to/PopPUNK/sketchlib.py", line 575, in queryDatabase
distMat = pp_sketchlib.queryDatabase(ref_db_name=ref_db,
RuntimeError: Query with empty ref or query list!
It also seems to render the partially overwritten database unusable.
If I only remove the --overwrite
and output to another directory:
python3 /path/to/poppunk_assign-runner.py --db GPS_v6 --query query_for_v8_update.txt --external-clustering GPS_v6_external_clusters.csv --update-db --output GPS_v8 --threads 32
The same command can be completed without error.
But then original issue remains, without --overwrite
and output to another directory, when I use the newly created GPS_v8 as it is without --model-dir
pointing back to GPS_v6, it will result in the following error when assigning lineage to samples:
ERROR: There are 42482 vertices in the network but 42209 reference names supplied; please check the '--model-dir' variable is pointing to the correct directory
Please let me know if you have any idea on this @johnlees. Thanks!
I think the solution should be the following:
--update-db
, we should automatically copy the required files to the new output directory, including the model.@nickjcroucher does this seem reasonable to you?
That sounds great John, if you and Nick agree on that and push out an update, I am more than happy to test again.
@johnlees - yes, I agree, that behaviour makes sense - should we accompany the database updating with some kind of versioning as well? Or have some other way of identifying/archiving the "parent" database, and replacing it with the latest one? Just thinking about continuous updating as a backend to online systems, but I'm sure there are better thought through approaches that already exist.
@johnlees I am just wondering if you have an estimation on when the new update process will be available? Thanks for all the amazing work so far!
Realistically end of this year at best. We could talk about this (in person) if we need it urgently
Thanks for the update! I will check with my team regarding its urgency.
This was fixed in #290
Versions
Current master branch e555990 / 2.6.1
Command used and output returned
python3 PopPUNK/poppunk_assign-runner.py --db ../../sl28/PopPUNK2/GPS_v8 --query qfile.txt --external-clustering ../../sl28/PopPUNK2/GPS_v8_external_clusters.csv --output output --threads 32
Describe the bug
Crash after network loading