Constructing pan-genome

Tonitsk8264 commented 11 months ago

Dear Developer.

I am currently using Minigraph-Cactus to perform a wheat pan-genome build on 24 wheat genomes, sequences from specific regions on the same chromosome. However, in the result file (gfa), I find that only part of the wheat genome is included on path, not all of it.

I suspect this may be due to the high level of divergence between the different samples. Although the value of the minIdentity parameter has been set to 0.5 in the cactus_progressive_config.xml configuration file, it did not achieve the results I was expecting. Therefore, I would like to ask for advice on how to modify the parameter in the configuration file to better handle the issue of divergence between samples and ensure that all chromosomes of the wheat genome are correctly included in the pan-genome, thus enabling a complete pan-genome construction for all samples.

Thank you for your time and support.

Best regards.

W       Avent_RM271     0       Chr6N   0       37412338        >1>2>3>4>5>7>8>10>11>13>14>15>16>18>19>21>22>24>25>27>28>30>31>33>34>3>
W       Taest_CDCStanley        0       2A      0       33570816        >3>4>5>7>8>10>11>13>14>15>16>18>19>21>22>24>25>27>28>30>31>33>>
W       Taest_Jagger    0       2A      0       32615472        >3>4>5>7>8>10>11>13>14>15>16>18>19>21>22>24>25>27>28>30>31>33>34>36>37>
W       Taest_Mace      0       2A      0       33050520        >564491>564493>564494>564496>564497>564498>564499>564501>564502>564503>
W       Taest_Renan     0       chr2A   0       34099443        >2>3>4>5>6>8>9>11>12>14>16>17>19>20>22>23>25>26>28>29>31>32>34>35>37>3>
W       Taest_SYMattis  0       2A      0       31852674        >3>4>5>7>8>10>11>13>14>15>16>18>19>21>22>24>25>27>28>30>31>33>34>36>37

glennhickey commented 11 months ago

Minigraph doesn't work well at high divergences. Near the beginning of the log, you should be able to see the mash distances of all your genomes to the reference, and it will even give you a warning if any seem too high. Are you able to share this part of your log?

Tonitsk8264 commented 11 months ago

cactus-pangenome.log

Yes, some wheat genomes have higher mash distances from the reference. In this case, can we adjust the parameters to add these genomes to the pan-genome?

glennhickey commented 11 months ago

Yeah, there's supposed to be a warning for distances > 0.02 -- strange that it's not in your log. But anyway, 0.097 is way higher than minigraph-cactus is used to dealing with, and I don't think there are any parameters to change this.

You'd have to cut down your inputs to only genomes <0.02 from the reference, or you can make a tree (ex with mashtree) and properly align this data with Progressive Cactus. You can also try PGGB, which lets you map with more sensitive parameters, but if your final graph has a mutation at every position, you may struggle to use it for anything.

mash distance of Turar_G1812 (size = 24691619) to reference Avent_RM271 = 0.0974869
mash distance of Tmono_TA299 (size = 27091863) to reference Avent_RM271 = 0.0968463
mash distance of Taest_LongReachLancer (size = 25471049) to reference Avent_RM271 = 0.0968463
mash distance of Tmono_PI306540 (size = 27777195) to reference Avent_RM271 = 0.0949813
mash distance of Tduru_Svevo (size = 32440310) to reference Avent_RM271 = 0.0949813
mash distance of Taest_Aikang58 (size = 25900446) to reference Avent_RM271 = 0.0949813
mash distance of Taest_Fielder (size = 31607142) to reference Avent_RM271 = 0.0943778
mash distance of Taest_CDCLandmark (size = 26632610) to reference Avent_RM271 = 0.0937829
mash distance of Tmono_TA10622 (size = 28080119) to reference Avent_RM271 = 0.0926182
mash distance of Taest_Norin61 (size = 26012014) to reference Avent_RM271 = 0.0926182
mash distance of Taest_Kenong9204 (size = 30582808) to reference Avent_RM271 = 0.0926182
mash distance of Taest_Kariega (size = 30594703) to reference Avent_RM271 = 0.0914855
mash distance of Taest_Julius (size = 30246075) to reference Avent_RM271 = 0.0914855
mash distance of Taest_ChineseSpring (size = 29059918) to reference Avent_RM271 = 0.0914855
mash distance of Taest_ArinaLrFor (size = 27350083) to reference Avent_RM271 = 0.0914855
mash distance of Ttibe_Zang1817 (size = 29365020) to reference Avent_RM271 = 0.0898429
mash distance of Tspel_PI190962 (size = 30978064) to reference Avent_RM271 = 0.0893096
mash distance of Tdico_Zavitan (size = 30618420) to reference Avent_RM271 = 0.0857608
mash distance of Taest_Renan (size = 34099443) to reference Avent_RM271 = 0.00461146
mash distance of Taest_SYMattis (size = 31852674) to reference Avent_RM271 = 0.00157453
mash distance of Taest_Jagger (size = 32615472) to reference Avent_RM271 = 0.00128842
mash distance of Taest_CDCStanley (size = 33570816) to reference Avent_RM271 = 0.00105797
mash distance of Taest_Mace (size = 33050520) to reference Avent_RM271 = 0.0010072

Tonitsk8264 commented 11 months ago

Thanks for your reply and suggestions!

Tonitsk8264 commented 11 months ago

Sorry to bother you again, but I have another question about pan-genome construction. I hope you can help me figure it out：

Does minigraph-cactus support gradual increase? For instance, by initially building a pan-genome 'Pn' using n sequences, and subsequently adding a new sequence labeled 'x' to extend the pan-genome from 'Pn' to 'Pn+1', instead of starting the construction of the pan-genome from scratch with these n+1 sequences.

ComparativeGenomicsToolkit / cactus

Constructing pan-genome #1244