fpusan / SuperPang

Non-redundant pangenome assemblies from multiple genomes or bins
BSD 3-Clause "New" or "Revised" License
13 stars 1 forks source link

the pangeome size is larger than all size of input genomes #6

Closed yi1873 closed 2 years ago

yi1873 commented 2 years ago

When generate the pangenome of HIV-2 species, I got a 421,130bp pangenome which is larger than all size of input genomes. How to understand this situation? SuperPang.py --fasta 11709/genome/*.fa --output-dir 11709/pangenome --force-overwrite -t 20 --assume-complete -b 0.95 -i 0.95 -k 301

fpusan commented 2 years ago

Are you measuring pangenome size by looking at the graph.fastg or the NBPs.fasta file? The graph.fastg has a higher length since it includes overlaps at the sides of each NBP.

yi1873 commented 2 years ago

How to control the overlap length at the sides of each NBP?

fpusan commented 2 years ago

You don't need to. In the graph.fastg file, the overlap size is the kmer length. But for measuring the pangenome size you should just count the number of bps in the NPBs.fasta file, which does not contain overlaps.

yi1873 commented 2 years ago

The pangenome size I got was calculated by counting the number of bps in assembly.fasta.

fpusan commented 2 years ago

That should have been ok too. Any chance you can share your input genomes with me?

yi1873 commented 2 years ago

The genomes in GenBank for HIV-2 were selected from https://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt; cat assembly_summary_genbank.txt |awk -F '\t' '$7=="11709"{print $1}'

fpusan commented 2 years ago

Can not reproduce using the latest version of SuperPang. I got 39 HIV genomes using your command. In total, they contained 391049 bases. The assembly.fasta file had 355128 bases, which is less.