Cytoscape includes strains from the entire database even with the --include-files option

sydelstan commented 2 months ago

Version poppunk version: 2.5.0

Commands poppunk_assign --db GPS_v4 --query qfile.txt --output poppunk_clusters --threads 8 --external-clustering meta.csv --update-db

poppunk_visualise --ref-db poppunk_clusters --output grapetree_X --grapetree --include-files strains.csv --external-clustering meta.csv poppunk_visualise --ref-db poppunk_clusters --output phandango_X --phandango --include-files strains.csv --external-clustering meta.csv poppunk_visualise --ref-db poppunk_clusters --output cytoscape_X --cytoscape --network-file poppunk_clusters_refs_graph.gt --include-files strains.csv --external-clustering meta.csv

Output Graph-tools OpenMP parallelisation enabled: with 1 threads PopPUNK: visualise Loading previously refined model Completed model loading Reading existing tree from grapetree_Ia/grapetree_X_core_NJ.nwk Writing grapetree output Parsed data, now writing to CSV Unable to write phylogeny to grapetree_X/grapetree_X_core_NJ.nwk

Done

Graph-tools OpenMP parallelisation enabled: with 1 threads PopPUNK: visualise Loading previously refined model Completed model loading Reading existing tree from phandango_Ia/phandango_X_core_NJ.tree Writing phandango output Parsed data, now writing to CSV Unable to write phylogeny to phandango_Ia/phandango_X_core_NJ.tree

Done

Graph-tools OpenMP parallelisation enabled: with 1 threads PopPUNK: visualise Loading previously refined model Completed model loading Writing cytoscape output Network loaded: 3616 samples Parsed data, now writing to CSV

Describe the bug I was hoping to obtain grapetree, phandango, and cytoscape visualizations that only included the strains listed in strains.csv. Without the --include-files option, the visualization contains strains from the entire database. This option works for grapetree and phandago, but not cytoscape. The cytoscape figrure doesn't include the entire database with this option (without the option it includes the full database), but it still includes thousands of samples not listed in strains.csv

johnlees commented 2 months ago

Can you try with the latest version (v2.6.5) and confirm whether you still get this issue there? Looking at the release history I see we've made multiple changes to the visualisation code since v2.5.0

sydelstan commented 2 months ago

I ran the same lines with v2.6.5 and I ended up with even more strains included in the cytoscape network. Am I calling the correct reference database and network file? Is it appropriate to simply remove the extraneous strains from the final network or is there something weird going on with the network generation overall?

johnlees commented 2 months ago

Thanks for the report and re-running. Looking at the code, where we (try) to do this is in these two places:

I can't see an obvious issue so would need to try and reproduce.

It would also help if you could let me know the numbers of files included in the database, visualisation and subset file; and also give an example of a strain that's include in the visualisation but not the subset file.

Am I calling the correct reference database and network file?

Those commands look alright to me, with the possible exception that if you are running --update-db you need the full database not just the references (I see around ~3600 loaded from one of the messages) – but I can't tell for sure that's a problem from the output above. Did you use the full or reference DB?

Is it appropriate to simply remove the extraneous strains from the final network or is there something weird going on with the network generation overall?

Yes that would be fine as that's exactly what the code is supposed to do.

sydelstan commented 2 months ago

It would also help if you could let me know the numbers of files included in the database, visualisation and subset file; and also give an example of a strain that's include in the visualisation but not the subset file.

I believe the database contains 40K strains. For version v2.6.5, running the below command got me the closest:

poppunk_visualise --ref-db poppunk_clusters --output cytoscape_Ia --cytoscape --network-file poppunk_clusters.refs_graph.gt --include-files strains.csv --external-clustering meta.csv

the network has 5,090 strains, and the subset file has 1,693 strains. Sorry, I can't tell which strains from the database are included in the network. Only the ID's I provided for the strains I used to update the database carry over to cytoscape, the database strains are given a number.

Those commands look alright to me, with the possible exception that if you are running --update-db you need the full database not just the references (I see around ~3600 loaded from one of the messages) – but I can't tell for sure that's a problem from the output above. Did you use the full or reference DB?

For the visualization, I am using the database that is the output of this command:

poppunk_assign --db GPS_v4 --query qfile.txt --output poppunk_clusters --threads 8 --external-clustering meta.csv --update-db

I am not sure if this is considered the full or reference database.

Yes that would be fine as that's exactly what the code is supposed to do.

Okay, if it is okay to delete extraneous/nodes strains while preserving the network architecture I will probably just do that then!

sydelstan commented 1 month ago

I am also having issues with assigning GPSCs -- the external clustering file labels every strain as "NA" even though the GPSC of most of these strains have been previously established. My cytoscape network also produces a lot more distinct clusters than I'd expect given how closely related the strains are, so I wonder if these issues are related at all to the previous one.

bacpop / PopPUNK

Cytoscape includes strains from the entire database even with the --include-files option #309