A-J-F-Mackintosh / syngraph

Toolkit for evolutionary analyses of linkage groups
GNU General Public License v3.0
21 stars 2 forks source link

No rearrangements identified #10

Closed lesserof2weevils closed 1 month ago

lesserof2weevils commented 3 months ago

Hi,

Thanks for putting together this cool piece of software. I'm having an issue with the -infer step, where it runs correctly, but no rearrangements are identified. I'm using buscos for the input markers and an inferred tree. It seems that it's only finding a median genome with 1 chromosome? The assemblies I'm testing have chromosome numbers spanning 11 - 39. Any advice would be great!

Dataset I'm testing: https://www.biorxiv.org/content/10.1101/2024.06.25.600716v1.full

Thanks, James

[+] Creating Syngraph from file ...
[+] Show Syngraph metrics ...
[=] ====================================
[=] Taxa = 7
[=] Nodes (Markers) = 2116
[=] Nodes (Markers) shared by > 1 taxon = 2114
[=] Nodes (Markers) shared by all taxa = 1625
[=] Distinct Edges (Adjacencies) = 12709
[=] Subgraphs (connected components) = 1
[=] ====================================
[+] Starting traversal ...
[+] ========================================================================
[+] Inferring median genome for n3 using data from Dpon, Dvale, and Pcerv ...
[=] Generated 1 LMSs containing 1830 markers
[=] A total of 286 markers are not assigned to an LMS
[=] Generated 1 connected components
[=] Found a median genome with 1 chromosomes
[=] Assigned 253 of the markers not within any LMS
[=] ========================================================================
[+] Inferring median genome for n10 using data from ESF13, EFF26, and Tbico ...
[=] Generated 1 LMSs containing 2011 markers
[=] A total of 105 markers are not assigned to an LMS
[=] Generated 1 connected components
[=] Found a median genome with 1 chromosomes
[=] Assigned 82 of the markers not within any LMS
[=] ========================================================================
[+] Inferring median genome for n8 using data from Tbico, n10, and Initi ...
[=] Generated 1 LMSs containing 1893 markers
[=] A total of 223 markers are not assigned to an LMS
[=] Generated 1 connected components
[=] Found a median genome with 1 chromosomes
[=] Assigned 204 of the markers not within any LMS
[=] ========================================================================
[+] Inferring median genome for n4 using data from Initi, n8, and n3 ...
[=] Generated 1 LMSs containing 1923 markers
[=] A total of 193 markers are not assigned to an LMS
[=] Generated 1 connected components
[=] Found a median genome with 1 chromosomes
[=] Assigned 176 of the markers not within any LMS
[=] ========================================================================
[+] Inferring median genome for n2 using data from n3, n4, and Pcerv ...
[=] Generated 1 LMSs containing 2037 markers
[=] A total of 79 markers are not assigned to an LMS
[=] Generated 1 connected components
[=] Found a median genome with 1 chromosomes
[=] Assigned 63 of the markers not within any LMS
[=] ========================================================================
[+] Save Syngraph to file ...
[+] Saved Syngraph in 'scolytinae_syngraph.with_ancestors.pickle'
[*] Total runtime: 0.237s
A-J-F-Mackintosh commented 3 months ago

Hi James,

Inferring ancestral genomes with only one chromosome is possible, but it is usually a sign of an extreme rearrangement history (where syngraph can't infer anything useful), or a mistake in the input files.

The fact that only 1 LMS (linked marker set) is generated for each triplet (and no rearrangements are inferred) suggests that there could be a problem with the input files. For example, if you gave all sequences the same name within each tsv file then these are the kind of results I would expect.

Could you share an example tsv file, or even just the first and last ten lines of one?

Alex

lesserof2weevils commented 3 months ago

Hi Alex,

Below is an example of the first and last ten lines of one of the .tsv files. They are all tab delimited and all sequences have different names, from the BUSCO full table.tsv. From a synteny analysis we identified extreme rearrangements in these species, with chromosome counts tripling.

Thanks, James

7at33392    Dpon    14899570    14932685
123at33392  Dpon    6609335 6621773
263at33392  Dpon    50257718    50284632
416at33392  Dpon    11714338    11720402
693at33392  Dpon    20000715    20012903
695at33392  Dpon    8381999 8405013
727at33392  Dpon    14550117    14564478
734at33392  Dpon    5485439 5496643
764at33392  Dpon    14207125    14224420
786at33392  Dpon    12709360    12724385
133879at33392   Dpon    457075  457521
133986at33392   Dpon    3685174 3685705
134030at33392   Dpon    19179151    19180241
134292at33392   Dpon    4850650 4851797
134501at33392   Dpon    34527726    34545461
134899at33392   Dpon    4451849 4453009
135025at33392   Dpon    19585697    19588931
135312at33392   Dpon    1735116 1735841
135764at33392   Dpon    5213116 5213812
135985at33392   Dpon    8767213 8767708
137542at33392   Dpon    11543731    11544226
A-J-F-Mackintosh commented 3 months ago

Hi,

The second column in the tsv file should contain the sequence, not the taxon name (which is instead specified in the file name, e.g. genus_species.tsv). So your tsv file should be something more like:

7at33392    chromosome_1    14899570    14932685
123at33392  chromosome_1    6609335 6621773
...
135985at33392   chromosome_20   8767213 8767708
137542at33392   chromosome_20   11543731    11544226

I would suggest grepping these four fields directly from the tsv file generated by BUSCO.

I would expect this to resolve your issue with the infer module, but I'll leave the issue open and you can let me know how it goes.

Cheers,

Alex

lesserof2weevils commented 3 months ago

Hi Alex,

Thanks for this. Yes, this was indeed the problem!! This has been resolved now