ekg / seqwish

alignment to variation graph inducer
MIT License
143 stars 18 forks source link

Genome graph induced from wfmash+seqwish is still biased #121

Open yihangs opened 9 months ago

yihangs commented 9 months ago

Hi there,

As shown in the paper of seqwish, the genome graph induced by wfmash+seqwish is unbiased, i.e., the graph will not be changed by the order of input genomes. However, a recent paper, "Comparing methods for constructing and representing human pangenome graphs", says "The first two of the three phases of the pggb pipeline (all-vs-all alignment and graph imputation) produce the same result on different runs with the same input but differences arise when the order of the input haplotypes changes" (Section: Stability). Since the first two phases of pggb are exactly wfmash and seqwish, this result seems to indicate that the induced graph is still biased. Do you know why these two papers have contradictory results?

Thanks!

ekg commented 8 months ago

Re running smoothxg will generate a slightly different version of the graph. This is due to stochastic parallel sorting of the graph before blocks are selected for the MSA to be applied.

Changing the order of genomes should not itself have an effect on the structure of the graph. But re running smoothxg will almost always make a very slightly different result due to the inherent randomness.

This could be eliminated by running the sort algorithm in smoothxg single threaded. Then you could check for the effect of changing the input genome ordering in the FASTA file.

yihangs commented 8 months ago

Thanks! However, according to their paper, changing the order of genomes will affect the structure of the graph before the smoothxg step, which means this effect should come from wfmash or seqwish.

subwaystation commented 8 months ago

I was never able to observe such a situation. I was also confused to read this in their paper. It was already extensively shown in https://academic.oup.com/bioinformatics/article/39/1/btac743/6854971 that seqwish itself is deterministic and not affected by the order of the input genomes. I think their paper may have a hiccup here.