IndexError: list index out of range (running current `master` branch) when running `poppunk_assign`

HarryHung commented 12 months ago

Versions

Current master branch e555990 / 2.6.1

Command used and output returned python3 PopPUNK/poppunk_assign-runner.py --db ../../sl28/PopPUNK2/GPS_v8 --query qfile.txt --external-clustering ../../sl28/PopPUNK2/GPS_v8_external_clusters.csv --output output --threads 32

Describe the bug

Crash after network loading

...
Loading network from ../../sl28/PopPUNK2/GPS_v8/GPS_v8_graph.gt
Network loaded: 42482 samples
There are 42482  vertices in the network but 42209 reference names supplied; please check the '--model-dir' variable is pointing to the correct directory
Traceback (most recent call last):
  File "/lustre/scratch125/pam/teams/team284/ch31/poppunk_test/PopPUNK/poppunk_assign-runner.py", line 13, in <module>
    main()
  File "/lustre/scratch125/pam/teams/team284/ch31/poppunk_test/PopPUNK/PopPUNK/assign.py", line 211, in main
    assign_query(dbFuncs,
  File "/lustre/scratch125/pam/teams/team284/ch31/poppunk_test/PopPUNK/PopPUNK/assign.py", line 307, in assign_query
    isolateClustering = assign_query_hdf5(dbFuncs,
  File "/lustre/scratch125/pam/teams/team284/ch31/poppunk_test/PopPUNK/PopPUNK/assign.py", line 603, in assign_query_hdf5
    addQueryToNetwork(dbFuncs,
  File "/lustre/scratch125/pam/teams/team284/ch31/poppunk_test/PopPUNK/PopPUNK/network.py", line 1288, in addQueryToNetwork
    G = construct_network_from_assignments(rList,
  File "/lustre/scratch125/pam/teams/team284/ch31/poppunk_test/PopPUNK/PopPUNK/network.py", line 1094, in construct_network_from_assignments
    G = construct_network_from_edge_list(rlist, qlist, connections,
  File "/lustre/scratch125/pam/teams/team284/ch31/poppunk_test/PopPUNK/PopPUNK/network.py", line 750, in construct_network_from_edge_list
    process_previous_network(previous_network = previous_network,
  File "/lustre/scratch125/pam/teams/team284/ch31/poppunk_test/PopPUNK/PopPUNK/network.py", line 650, in process_previous_network
    extra_sources, extra_targets = network_to_edges(previous_network,
  File "/lustre/scratch125/pam/teams/team284/ch31/poppunk_test/PopPUNK/PopPUNK/network.py", line 542, in network_to_edges
    source_ids = [old_id_indices[x] for x in old_source_ids]
  File "/lustre/scratch125/pam/teams/team284/ch31/poppunk_test/PopPUNK/PopPUNK/network.py", line 542, in <listcomp>
    source_ids = [old_id_indices[x] for x in old_source_ids]
IndexError: list index out of range

johnlees commented 12 months ago

Our error message isn't working properly here and should be giving you the list of missing samples, which we need to fix.

But ultimately this is being caused by a mismatch in samples between clusters, network and database. Can you check the number in each of these inputs? You can also use poppunk_info to help with this, see https://poppunk.readthedocs.io/en/latest/sketching.html#viewing-information-about-a-database

HarryHung commented 12 months ago

Thanks for getting back to me John.

The output of poppunk_info is

Graph-tools OpenMP parallelisation enabled: with 1 threads
PopPUNK database:       GPS_v8
Sketch version:         c42cd0e22f6ef6d5c9a2900fa16367e096519170
Number of samples:      42482
K-mer sizes:            14,17,20,23,26,29
Sketch size:            9984
Contains random matches:    True
Codon phased seeds:     False

There is 42483 lines in GPS_v8_clusters.csv (i.e. 42482 samples minus the header)

There is 42483 lines in GPS_v8_external_clusters.csv (i.e. 42482 samples minus the header)

And according to the above error output of PopPUNK, it seems the network has 42482 samples as well:

Network loaded: 42482 samples

As I did not create this version of the database, I am wondering is there a way to deduce whether it is a database issue, or a general PopPUNK bug?

johnlees commented 12 months ago

Thanks for checking Harry.

As I did not create this version of the database, I am wondering is there a way to deduce whether it is a database issue, or a general PopPUNK bug?

It's hard to say I think – of course in general assignment works in this version (it's in the CI) – I suspect it's a database/files issue, but that's not to say that PopPUNK isn't at fault for picking up a bad default or being unhelpful with the error.

Any idea where the 42209 in the original error is coming from? How many lines does the .refs file in the model directory have? Have you/Steph tried using --model-dir to point at the original fit?

HarryHung commented 12 months ago

It seems you have a good hunch on the root cause.

The line count of GPS_v8.refs is indeed 42209.

I have checked with Steph and she ran the command (no --model-dir)

poppunk_assign --db GPS_v6 --distances GPS_v6/GPS_v6.dists --query query.txt --external-clustering GPS_v6_external_clusters.csv --update-db --output GPS_v8

It seems in this case, --model-dir is necessary to make the updated DB works?

Should I ask her to rerun the command as

poppunk_assign --db GPS_v6 --distances GPS_v6/GPS_v6.dists --model-dir GPS_v6 --query query.txt --external-clustering GPS_v6_external_clusters.csv --update-db --output GPS_v8

Or is there a simpler fix?

If I understand the issue correctly, I am wondering why --model-dir is not a mandatory option when --update-db is used. Or is there a use case that require the absence of --model-dir? There seems to be an explanation at https://poppunk.readthedocs.io/en/latest/query_assignment.html#updating-the-database, but I don't think I really understand that to be honest.

johnlees commented 12 months ago

I think the command to try should be:

python3 PopPUNK/poppunk_assign-runner.py --db ../../sl28/PopPUNK2/GPS_v8 --query qfile.txt --external-clustering ../../sl28/PopPUNK2/GPS_v8_external_clusters.csv --model-dir ../../sl28/PopPUNK2/GPS_v6 --output output

Let's see if that works.

I think our expected use is that everything for the DB is in one folder and not versioned like this, so that when you point to the DB directory it has everything in it that's needed.

The easier thing to do is just copy the fit from v6 over to the v8 directory, but I appreciate which files are needed isn't clear.

If this is indeed the issue I will try and update the documentation to be clearer.

HarryHung commented 12 months ago

Thank you John!

I misunderstood the usage of --model-dir, I mistakenly thought it was associated to the --update-db option, now I can see it is asking poppunk_assign to use an alternative list of references.

I can confirm the following command can be completed successfully

python3 PopPUNK/poppunk_assign-runner.py --db ../../sl28/PopPUNK2/GPS_v8 --query qfile.txt --external-clustering ../../sl28/PopPUNK2/GPS_v8_external_clusters.csv --model-dir ../../sl28/PopPUNK2/GPS_v6 --output output --threads 32

So if I want to have everything in a single database directory and forgo the need to use --model-dir when running poppunk_assign, which files exactly should I copy (and change v6 in the file names to v8 I suppose) from v6 to v8 directory? And will such process result in the loss of information gained by updating the database via --update-db?

If we want to maintain versioned database, the best way forward seems to be copy the database directory, and in the newly created directory run the below command?

poppunk_assign --db GPS --query query.txt --update-db --output GPS --overwrite --external-clustering GPS_external_clusters.csv

On another note, I notice apart from the files listed in https://poppunk.readthedocs.io/en/latest/query_assignment.html#downloading-a-database, there are a few extra files in the database directory that are not mentioned in the doc:

GPS_v8.refs.dists.npy
GPS_v8.refs.dists.pkl
GPS_v8_refs_graph.gt
GPS_v8.refs.h5

Are they neccessary? As they double the size of the database directory (there are non-.refs counterpart of them in the directory)

.
├── [583K]  GPS_v8_clusters.csv
├── [6.7G]  GPS_v8.dists.npy
├── [542K]  GPS_v8.dists.pkl
├── [589K]  GPS_v8_external_clusters.csv
├── [1.1K]  GPS_v8_fit.npz
├── [  26]  GPS_v8_fit.pkl
├── [ 20M]  GPS_v8_graph.gt
├── [4.3G]  GPS_v8.h5
├── [456K]  GPS_v8.refs
├── [6.6G]  GPS_v8.refs.dists.npy
├── [539K]  GPS_v8.refs.dists.pkl
├── [ 20M]  GPS_v8_refs_graph.gt
├── [4.2G]  GPS_v8.refs.h5
└── [ 15K]  GPS_v8_unword_clusters.csv

HarryHung commented 11 months ago

@johnlees On a side note, I am wondering when running poppunk_assign, does the loading of a huge database like GPS take up the majority of run time? Because I am considering whether I should run poppunk_assign separately on each assembly to increase the stability of my pipeline instead of the current all-samples-in-one-run approach, but will not opt for it if it has a hefty time cost penalty.

johnlees commented 11 months ago

So if I want to have everything in a single database directory and forgo the need to use --model-dir when running poppunk_assign, which files exactly should I copy (and change v6 in the file names to v8 I suppose) from v6 to v8 directory? And will such process result in the loss of information gained by updating the database via --update-db?

It would be the _fit.pkl and _fit.npy files.

If we want to maintain versioned database, the best way forward seems to be copy the database directory, and in the newly created directory run the below command?
poppunk_assign --db GPS --query query.txt --update-db --output GPS --overwrite --external-clustering GPS_external_clusters.csv 

That's one option. You could also output to a new directory without overwrite

On another note, I notice apart from the files listed in https://poppunk.readthedocs.io/en/latest/query_assignment.html#downloading-a-database, there are a few extra files in the database directory that are not mentioned in the doc:
* `GPS_v8.refs.dists.npy`

* `GPS_v8.refs.dists.pkl`

* `GPS_v8_refs_graph.gt`

* `GPS_v8.refs.h5`
Are they neccessary? As they double the size of the database directory (there are non-.refs counterpart of them in the directory)
.
├── [583K]  GPS_v8_clusters.csv
├── [6.7G]  GPS_v8.dists.npy
├── [542K]  GPS_v8.dists.pkl
├── [589K]  GPS_v8_external_clusters.csv
├── [1.1K]  GPS_v8_fit.npz
├── [  26]  GPS_v8_fit.pkl
├── [ 20M]  GPS_v8_graph.gt
├── [4.3G]  GPS_v8.h5
├── [456K]  GPS_v8.refs
├── [6.6G]  GPS_v8.refs.dists.npy
├── [539K]  GPS_v8.refs.dists.pkl
├── [ 20M]  GPS_v8_refs_graph.gt
├── [4.2G]  GPS_v8.refs.h5
└── [ 15K]  GPS_v8_unword_clusters.csv

I think as long as you keep the GPS_v8.refs file the rest can be removed

johnlees commented 11 months ago

@johnlees On a side note, I am wondering when running poppunk_assign, does the loading of a huge database like GPS take up the majority of run time? Because I am considering whether I should run poppunk_assign separately on each assembly to increase the stability of my pipeline instead of the current all-samples-in-one-run approach, but will not opt for it if it has a hefty time cost penalty.

Yes I think this will be most of the cost. Have you looked at the --serial option? That might be useful as that does assignment one-by-one, but only loads the database once.

HarryHung commented 11 months ago

Thanks for the info!

I will try to consolidate the files into a single directory and/or removing unnecessary files in the published database for assignment, then report back to you.

HarryHung commented 11 months ago

It seems the full description of --serial is not available in the PopPUNK documentation. But if I want to run PopPUNK for each sample as an individual job in different containers (which is the preferred way of Nextflow), it seems --serial would not help.

I think it is fine for now, I would just run PopPUNK for all samples at once and split the results. Will only need to reconsider this approach if someone complains about the stability of PopPUNK (for now, it seems poppunk_assign is quite stable, if the database itself is okay)

HarryHung commented 11 months ago

It would be the _fit.pkl and _fit.npy files.

@johnlees I guess it should be _fit.npz, as there is no _fit.npy.

Based on md5sum, the following have the same content:

GPS_v6_fit.pkl vs GPS_v8_fit.pkl
GPS_v6_fit.npz vs GPS_v8_fit.npz

I tried to replace the original GPS_v8_fit.pkl and GPS_v8_fit.npzwith GPS_v6_fit.pkl and GPS_v6_fit.npz (also renamed v6 to v8) anyway, the same error occurs.

I notice that when I point to GPS_v6 with --model-dir (while --db points to GPS_v8), the network file being loaded are different:

--db points to GPS_v8 with GPS_v8_fit.pkl and GPS_v8_fit.npz replaced, but without --model-dir:

Loading network from ../poppunk_db/GPS_v8/GPS_v8_graph.gt
Network loaded: 42482 samples

--db points to GPS_v8 and with --model-dir points to GPS_v6:

Loading network from ../poppunk_db/GPS_v6/GPS_v6_graph.gt
Network loaded: 42163 samples

This might be related to when --model-dir is used, model_prefix is replaced

    if model_dir is not None:
        model_prefix = model_dir

subsequently, the replaced model_prefix is used to read more files

        ref_file_name = os.path.join(model_prefix,
                        os.path.basename(model_prefix) + file_extension_string + ".refs")

therefore, simply replacing two _fit.pkl and _fit.npz files do not seem to be enough.

I would be grateful if you could suggest what I should do next. As by replacing files one by one, I am afraid I might revert everything back to GPS_v6. Or at this point, we should just re-generate v8 using the previously mentioned command with --overwrite option?

johnlees commented 10 months ago

I think I need to do two things here (in #279):

Make sure the overwrite option works properly
Check that the .refs file after update-db in the output directory is correct

HarryHung commented 10 months ago

Sounds good to me. Thanks John!

HarryHung commented 10 months ago

Sorry for bothering you again @johnlees , I am wondering is branch i273 of https://github.com/bacpop/PopPUNK/pull/279 ready for testing, or should I wait for it to be merged into master and test on master branch instead?

Thanks!

johnlees commented 10 months ago

Sorry, there was a difficult issue with one of the dependencies which took a while to resolve. I think i273 should be ok, but I wanted to run a final test before merging it. I won't have time to do that for a while, so I would go ahead and use that branch, it should fix your issue (and if it doesn't then I'll know what to further change!)

HarryHung commented 10 months ago

Sure. I will test it out and let you know how it goes. Thanks!

HarryHung commented 10 months ago

Sadly, it is still not working properly.

I have noticed there is another line blocking the --overwrite option to function properly, and I attempt to fix that in PR https://github.com/bacpop/PopPUNK/pull/282

However, even with that fix in place, Running python3 /path/to/poppunk_assign-runner.py --db GPS_v6 --query query_for_v8_update.txt --external-clustering GPS_v6_external_clusters.csv --update-db --overwrite --output GPS_v6 --threads 32

Still leads to the following error:

PopPUNK: assign
    (with backend: sketchlib v2.1.1
     sketchlib: /path/to/miniconda3/envs/pp_env/lib/python3.10/site-packages/pp_sketchlib.cpython-310-x86_64-linux-gnu.so)
Mode: Assigning clusters of query sequences

Graph-tools OpenMP parallelisation enabled: with 32 threads
Overwriting db: GPS_v6/GPS_v6.h5
Sketching 319 genomes using 32 thread(s)

Progress (CPU): 0 / 319
Progress (CPU): 3 / 319
Progress (CPU): 42 / 319
Progress (CPU): 69 / 319
Progress (CPU): 103 / 319
Progress (CPU): 135 / 319
Progress (CPU): 167 / 319
Progress (CPU): 201 / 319
Progress (CPU): 236 / 319
Progress (CPU): 272 / 319
Progress (CPU): 298 / 319
Progress (CPU): 319 / 319
Writing sketches to file
Loading previously refined model
Completed model loading
Missing sketch: Unable to open the group "/sketches/10050_2#1": (Symbol table) Object not found
Could not find random match chances in database, calculating assuming equal base frequencies
Traceback (most recent call last):
  File "/path/to/poppunk_assign-runner.py", line 13, in <module>
    main()
  File "/path/to/PopPUNK/assign.py", line 211, in main
    assign_query(dbFuncs,
  File "/path/to/PopPUNK/assign.py", line 307, in assign_query
    isolateClustering = assign_query_hdf5(dbFuncs,
  File "/path/to/PopPUNK/assign.py", line 474, in assign_query_hdf5
    qrDistMat = queryDatabase(rNames = rNames,
  File "/path/to/PopPUNK/sketchlib.py", line 575, in queryDatabase
    distMat = pp_sketchlib.queryDatabase(ref_db_name=ref_db,
RuntimeError: Query with empty ref or query list!

It also seems to render the partially overwritten database unusable.

If I only remove the --overwrite and output to another directory: python3 /path/to/poppunk_assign-runner.py --db GPS_v6 --query query_for_v8_update.txt --external-clustering GPS_v6_external_clusters.csv --update-db --output GPS_v8 --threads 32 The same command can be completed without error.

But then original issue remains, without --overwrite and output to another directory, when I use the newly created GPS_v8 as it is without --model-dir pointing back to GPS_v6, it will result in the following error when assigning lineage to samples: ERROR: There are 42482 vertices in the network but 42209 reference names supplied; please check the '--model-dir' variable is pointing to the correct directory

Please let me know if you have any idea on this @johnlees. Thanks!

johnlees commented 9 months ago

I think the solution should be the following:

When running with --update-db, we should automatically copy the required files to the new output directory, including the model.
We remove overwrite as an option from assign, it adds too much complexity.
We properly document how to update databases, adding a new page

@nickjcroucher does this seem reasonable to you?

HarryHung commented 9 months ago

That sounds great John, if you and Nick agree on that and push out an update, I am more than happy to test again.

nickjcroucher commented 9 months ago

@johnlees - yes, I agree, that behaviour makes sense - should we accompany the database updating with some kind of versioning as well? Or have some other way of identifying/archiving the "parent" database, and replacing it with the latest one? Just thinking about continuous updating as a backend to online systems, but I'm sure there are better thought through approaches that already exist.

HarryHung commented 8 months ago

@johnlees I am just wondering if you have an estimation on when the new update process will be available? Thanks for all the amazing work so far!

johnlees commented 8 months ago

Realistically end of this year at best. We could talk about this (in person) if we need it urgently

HarryHung commented 8 months ago

Thanks for the update! I will check with my team regarding its urgency.

johnlees commented 7 months ago

This was fixed in #290

bacpop / PopPUNK

IndexError: list index out of range (running current `master` branch) when running `poppunk_assign` #273