brinkmanlab / IslandCompare

Pipeline for detecting and annotating genomic islands and relationships between the respective genomes
Other
4 stars 2 forks source link

Not all genomes show on visualization #107

Open KeithTilley opened 7 years ago

KeithTilley commented 7 years ago

Test analyses with the three test genomes are occasionally showing only two. All genomes are in the 'data' variable of the 'createVisualization' function in result.html / visible in the REST framework at localhost:8000/analysis/results/X, but there is only one alignment and only the displayed genomes are in 'newick'. The error is believed to be in the computation of the newick result as the visualization and the alignment both use the newick result to select which genomes to run.

KeithTilley commented 7 years ago

Roughly 30 test runs were done with the test genomes and only three different 'newick' results were found: "(78:8.48918,77:0.00554,7:0.00553):0.00000;↵" "(7:0.00541,78:8.49648,77:0.00560):0.00000;↵" "(77:200.10367,78:200.10367):0.00000;↵" (where 7 = CP000305, 77 = AE009952, and 78 = BX936398)

Some variance is expected as running Parsnp independently from IslandCompare also results in three different newick results. However, all independent Parsnp runs returned newick results with all three genomes present: (BX936398:8.48898,CP000305:0.00554,AE009952:0.00553):0.00000; (CP000305:0.00541,BX936398:8.49648,AE009952:0.00560):0.00000; (AE009952:0.00552,BX936398:8.51916,CP000305:0.00529):0.00000;

Note that the first lines are slightly different in the value assigned to genome BX936398/78, and the second lines are the same.

I am unable to discern any pattern in the occurrence of the Parsnp results. While they are limited to three arrangements, the choice appears to be random.

KeithTilley commented 7 years ago

I have recreated the error independently from IslandCompare by renaming the .fna files from AE009952.fna, BX936398.fna and CP000305.fna to 77.fna, 78.fna and 7.fna and running Parsnp on these files to mimic the case that was running in IslandCompare. This makes Parsnp return any one of the three newick results as seen in the above comment, where the third result only contains two genomes.

It turns out that the newick results returned by parsnp are affected by the file names. Running just the genomes AE009952 (77) and CP000305 (7) in IslandCompare frequently causes the parsnp error: "Parsnp requires 2 or more genomes to run, exiting" even though in the celery logs, it can be seen that both genomes are present in the gbk_paths and gbk_metadata. I speculate that the names '7.fna' and 77.fna' could be responsible for the error. Possibly related to issue #106.

KeithTilley commented 7 years ago

Bug was caused by Parsnp when one .fna file name was a number of repetitions of another. Example: 7.fna and 77.fna or 1.fna and 111.fna. This caused some runs of Parsnp to return a newick that was missing one or more of the genomes, leading to the visualization issues. When there were only two genomes and this issue occurred, Parsnp would fail because it could not recognise at least two files.

Issue was solved by making Parsnp retry when these errors occur. This works because the issue does not occur every time as there is a degree of randomness to Parsnps results.

innovate-invent commented 5 years ago

This may be due to https://github.com/marbl/parsnp/issues/6

innovate-invent commented 5 years ago

Looks like ParSNP dev is dead, a possible alternative to ParSNP: https://sourceforge.net/projects/ksnp/ https://www.duo.uio.no/bitstream/handle/10852/60016/Sebastian_Soberg_Master_Thesis.pdf?sequence=5

As of Nov 2019, ParSNP has gone back into active development

innovate-invent commented 5 years ago

Note: Modify tool to check output and set failed state. Job runner should reschedule.