eblerjana / pangenie

Pangenome-based genome inference
MIT License
103 stars 10 forks source link

Using PanGenie for non-directed acyclic graph and pair-end reads #64

Open yihangs opened 7 months ago

yihangs commented 7 months ago

Hi,

I have two questions about the usage of PanGenie:

  1. The paper of PanGenie says that the constructed graph is a directed acyclic graph. Can PanGenie also be used on non-directed acyclic graphs (e.g. graph containing inversions and complex translocations)? If not, what are the difficulties?
  2. When we deal with pair-end reads, the normal case is that one read is a forward read, and one read is a reverse read. Therefore, when we compute the k-mers, we should convert the reverse read back to its reverse complementary, and then do the k-mer counting. However, without alignment, we actually don't know which read is the reverse read. How does PanGenie deal with this situation?

Thanks!

eblerjana commented 7 months ago
  1. In principle PanGenie can genotype everything that is represented in terms of a VCF with mulit-allelic, non-overlapping records. See: https://github.com/eblerjana/pangenie#required-input-files for details on the input files. So I guess in case of inversions/translocations, this would produce large bubble regions when expressed in VCF which might be tricky to genotype. Also, especially inversions are hard to genotype based on kmers, because the kmer set does not change for sequence inside of inversions. We never really evaluated the performance specifically for these events, but my expectation is that the performance is probably worse than for other SVs.
  2. PanGenie works in kmer space only, so we cannot distinguish between a kmer and its reverse complement. Therefore, PanGenie counts canonical kmers (jellyfish is run with-C switch), which means a kmer and its reverse complement are treated as equivalent (see: https://github.com/gmarcais/Jellyfish/tree/master/doc#counting-k-mers-in-sequencing-reads for details)