CCM output - Githubissues

johnomics commented 4 years ago

Hello,

Thank you for developing GALA, it's a very interesting tool. I'm confused about how CCM works - please could you help me out with this?

In the bioRxiv preprint, it seems like all of the preliminary assemblies are collapsed into one set of linkage groups ("the contig-clustering module (CCM) pools the linked nodes within different layers and those inside the same layer into different linkage groups"); the Online Methods for CCM suggest that CCM uses the raw reads to connect contigs, and "pools all connected nodes into a linkage group", when nodes can be linked across assemblies (layers).

But in actual use, the ccm tool doesn't seem to use the raw reads at all, and it outputs a separate set of scaffolds for each draft assembly. It seems to do a good job of grouping contigs within one assembly, but it doesn't group contigs from different assemblies. So I have a different number of linkage groups for each draft assembly.

(I'm giving GALA a set of 5 preliminary assemblies, and running, for example, gala/ccm comparison 5, where comparison is a folder containing the PAF files from running draft_comp.sh.)

The paper says "GALA modelled the preliminary assemblies and raw reads into 14 independent linkage groups" for C. elegans (Online Methods) - but how did you use the raw reads, and how did you identify one set of 14 linkage groups, from the separate sets of linkage groups output for each assembly? Am I missing something about running CCM?

Many thanks John

mawad89 commented 4 years ago

Hello John; CCM use intra-layer links to pool nodes (contigs) into different linkage groups. In a similar way, the raw reads are also partitioned into linkage groups. In actual use, CCM tool does not use raw reads to build up linage groups, this is to avoid the error caused by sample contamination or chimeric reads.

Theoretically, we should have the same number of linkage groups from each preliminary assembly. If not, you should check carefully the unique linkage group from a specific assembly, say by blasting on NCBI. If nothing abnormal, a possible reason could be the filtering criteria, such as similarity or mapping quality, used by GALA. In this case, the reformat module in GALA could help you to identify and resolve the problem manually.

In the case of the assembly of C.elegans, 11 linkage groups can cover almost all preliminary assemblies of the sample genome. However, each preliminary assembly contains several unique contigs from bacterial. For simplicity, we use Flye assembly to build up linkage groups and that set our number of linkage groups to 14.

Best wishes Mohamed

johnomics commented 4 years ago

Thank you very much for clarifying, that makes sense. When you say, "In a similar way, the raw reads are also partitioned into linkage groups", are you referring to the LGAM process, aligning the reads with bwa etc? Or do you mean to treat the raw reads as if they a preliminary assembly and run them through comp and ccm?

mawad89 commented 4 years ago

The first explanation is true. We referring to the LGAM process.

johnomics commented 4 years ago

OK - thank you very much, all is clear now! Best wishes, John

ganlab / GALA

CCM output #8