bacpop / PopPUNK

PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)
https://www.bacpop.org/poppunk
Apache License 2.0
86 stars 17 forks source link

Versions of input databases sketches are different and the network file choice for Cytoscape visualization #314

Open RuwiniK opened 1 month ago

RuwiniK commented 1 month ago

Hello, I'm new to the PopPunk platform, but I'm very interested in using this tool in my research. I'm using "poppunk_assign" and "poppunk_visualise" of the PopPUNK v2.6.5 to cluster a set of Streptococcus suis genomes (>1000) based on an existing reference database (clusters) in the popPUNK database, but I faced two issues.

Question 1: Data quality control When I ran the "Data quality control" step (--run-qc) with the "poppunk_assign", I got a warning "Versions of input databases sketches are different, results may not be compatible" and all the samples (in my 12 samples trial) and most of the samples (in the >1000 samples analysis) were failed. I used the "poppunk_info" to identify the Sketch version of my database: poppunk_clusters (Sketch version: cd9e995172e83f99458db43ec7d127e92b62813e) and the reference database: Ssuis_distribute_refs (Sketch version: ed862592a54de54d4b556a5e0530096c85f3a08d). How should I use the exact Sketch version of the reference database in my analysis?

Code: poppunk_assign --query rlist.txt --db Ssuis_distribute_refs --run-qc --max-zero-dist 1 --max-merge 3 --output Quality

Question 2: Cytoscape visualization What is the correct input file for the "--network-file" when using the poppunk_visualise with "assign query mode"? Since there is no graph.gt file in the poppunk_clusters folder, I tried using the graph.gt file in the reference database (as the input for the "--network-file"). But the resulted .graphml network doesn't have Query nodes (which makes sense) and the Cytoscape CSV output doesn't have a "Status" column (with "Query" or "Reference"), as mentioned in the documentation example.

Code: poppunk_assign --db Ssuis_distribute_refs --query rlist.txt --output poppunk_clusters --threads 8 poppunk_visualise --ref-db Ssuis_distribute_refs --query-db poppunk_clusters --output OUTPUT --cytoscape --network-file Ssuis_distribute_refs/Ssuis_distribute_refs.refs_graph.gt

Then, I tried updating the database (--update-db; it creates two .gt files) and specifying the --network-file with poppunk_clusters/poppunk_clusters_graph.gt) which also didn't create the "Status" column, but have all the Reference and Query nodes in the .graphml network (which makes sense). One thing I can do to solve this issue with "Status" column is manually creating it. Am I missing something here since the documentation (under Cytoscape) says "If you used assign query mode you will also have a column with ‘Query’ or ‘Reference’"?

Code: poppunk_assign --db Ssuis_distribute_refs --query rlist.txt --output poppunk_clusters --threads 8 --update-db poppunk_visualise --ref-db poppunk_clusters --output OUTPUT --cytoscape --network-file poppunk_clusters/poppunk_clusters_graph.gt

I greatly appreciate your help on these questions, and I apologize if these questions are already asked and answered somewhere. Thank you very much.

Best regards, Ruwni

johnlees commented 1 month ago

Versions of input databases sketches are different, results may not be compatible

This is usually not an issue, and you are using the latest version.

all the samples (in my 12 samples trial) and most of the samples (in the >1000 samples analysis) were failed.

Are you using assemblies or reads? What QC criteria are listed as failing? You can turn QC off and run anyway if you wish, to at least see what results you get. The --serial flag may be helpful here.

For visualisation, try adding your query database too:

poppunk_visualise --ref-db Ssuis_distribute_refs --query-db poppunk_clusters --output suis_viz --cytoscape
RuwiniK commented 4 weeks ago

Hello, thank you very much for the quick response.

Versions of input databases sketches are different, results may not be compatible

This is usually not an issue, and you are using the latest version.

Would you recommend an earlier PopPunk version for my analysis since the existing database must have been created using an earlier version? I guess my worry is whether version differences (in poppunk and sketchlib) can affect correctly identifying genome clusters.

all the samples (in my 12 samples trial) and most of the samples (in the >1000 samples analysis) were failed.

Are you using assemblies or reads? What QC criteria are listed as failing? You can turn QC off and run anyway if you wish, to at least see what results you get. The --serial flag may be helpful here.

I'm using assemblies. I didn't get any reasons for failing (no 'qcreport.txt' as mentioned in the documentation). I only got a .h5 file and the following screen output (this is for the 12-sample trial). I tried adding '--serial` but didn't change anything.

PopPUNK: assign (with backend: sketchlib v2.1.4 sketchlib: /usr/local/anaconda3/envs/poppunk/lib/python3.10/site-packages/pp_sketchlib.cpython-310-darwin.so) Mode: Assigning clusters of query sequences

Graph-tools OpenMP parallelisation enabled: with 1 threads Sketching 12 genomes using 1 thread(s) Progress (CPU): 12 / 12 Writing sketches to file Running QC on sketches Loading previously refined model Completed model loading WARNING: versions of input databases sketches are different, results may not be compatible Calculating distances using 1 thread(s) Progress (CPU): 100.0% Running QC on distance matrix Selected type isolate for distance QC is 00-3638-4B 12 samples failed:

For visualisation, try adding your query database too:

poppunk_visualise --ref-db Ssuis_distribute_refs --query-db poppunk_clusters --output suis_viz --cytoscape

Did you suggest adding the query database after updating the database (--update-db)? If you are suggesting it without updating the database (--update-db), then I already did that as shown in the above codes (restating the codes here) and didn't get Query nodes in the .graphml network and a "Status" column in the CSV output (but has cluster info for both Reference and Query IDs). Sorry for my confusion and if I wasn't clear in my 1st comment.

Code: poppunk_assign --db Ssuis_distribute_refs --query rlist.txt --output poppunk_clusters --threads 8 poppunk_visualise --ref-db Ssuis_distribute_refs --query-db poppunk_clusters --output OUTPUT --cytoscape --network-file Ssuis_distribute_refs/Ssuis_distribute_refs.refs_graph.gt

johnlees commented 4 weeks ago

Would you recommend an earlier PopPunk version for my analysis since the existing database must have been created using an earlier version? I guess my worry is whether version differences (in poppunk and sketchlib) can affect correctly identifying genome clusters.

No, this won't be a problem

I'm using assemblies. I didn't get any reasons for failing (no 'qcreport.txt' as mentioned in the documentation). I only got a .h5 file and the following screen output (this is for the 12-sample trial). I tried adding '--serial` but didn't change anything.

Based on the output I think your samples are failing on the distance quality check, which is described here: https://poppunk.bacpop.org/qc.html#qc-of-pairwise-distances You could try increasing --max-a-dist to 1

RuwiniK commented 3 weeks ago

Hello, I'm sorry for the late response.

Based on the output I think your samples are failing on the distance quality check, which is described here: https://poppunk.bacpop.org/qc.html#qc-of-pairwise-distances You could try increasing --max-a-dist to 1

Thank you very much for this suggestion; it worked. The paper that published the reference database outlined that the lineages were determined solely based on core distances. This approach could be to accommodate potentially high accessory values in S. suis, so increasing the accessory distance to 1 makes sense. Thank you.

About the Cytoscape: I'm unsure if my Cytoscape question in my last comment caught your attention especially since it was buried under a lengthy output; very easy to miss. So, I'm reiterating the question here. I greatly appreciate your insights on this issue too.

For visualisation, try adding your query database too:

poppunk_visualise --ref-db Ssuis_distribute_refs --query-db poppunk_clusters --output suis_viz --cytoscape

Did you suggest adding the query database after updating the database (--update-db)? If you are suggesting it without updating the database (--update-db), then I already did that as shown in the above codes (restating the codes here) and didn't get Query nodes in the .graphml network and a "Status" column in the CSV output (but has cluster info for both Reference and Query IDs). Sorry for my confusion and if I wasn't clear in my 1st comment.

Code: poppunk_assign --db Ssuis_distribute_refs --query rlist.txt --output poppunk_clusters --threads 8 poppunk_visualise --ref-db Ssuis_distribute_refs --query-db poppunk_clusters --output OUTPUT --cytoscape --network-file Ssuis_distribute_refs/Ssuis_distribute_refs.refs_graph.gt

Thank you very much.

johnlees commented 2 weeks ago

The second issue – ah I would have expected that to work. Yes you could try with update-db, or just add the column manually