Closed yi1873 closed 2 years ago
Are you measuring pangenome size by looking at the graph.fastg
or the NBPs.fasta
file? The graph.fastg
has a higher length since it includes overlaps at the sides of each NBP.
How to control the overlap length at the sides of each NBP?
You don't need to. In the graph.fastg
file, the overlap size is the kmer length. But for measuring the pangenome size you should just count the number of bps in the NPBs.fasta
file, which does not contain overlaps.
The pangenome size I got was calculated by counting the number of bps in assembly.fasta
.
That should have been ok too. Any chance you can share your input genomes with me?
The genomes in GenBank for HIV-2 were selected from https://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt
;
cat assembly_summary_genbank.txt |awk -F '\t' '$7=="11709"{print $1}'
Can not reproduce using the latest version of SuperPang.
I got 39 HIV genomes using your command.
In total, they contained 391049 bases. The assembly.fasta
file had 355128 bases, which is less.
When generate the pangenome of HIV-2 species, I got a 421,130bp pangenome which is larger than all size of input genomes. How to understand this situation?
SuperPang.py --fasta 11709/genome/*.fa --output-dir 11709/pangenome --force-overwrite -t 20 --assume-complete -b 0.95 -i 0.95 -k 301