merge_cutup_clustering.py truncates contig name when three zeros follow period

ccgallen commented 2 years ago

Hello, I am using concoct 1.0.0 and recently discovered an odd phenomenon when the file clustering_gt1000.csv is processed with merge_cutup_clustering.py to generate clustering_gt1000_merged.csv

Examples of lines in clustering_gt1000.csv are as follows:

, NODE_1_length_2595161_cov_8.709327.37,104 NODE_1_length_2595161_cov_8.709327.38,65 NODE_1_length_2595161_cov_8.709327.39,104 NODE_114750_length_1831_cov_0.514671,38 NODE_231037_length_1147_cov_1.000980,144 longer contigs have been split into fragments and each fragment is assigned to a cluster (after the comma). For those, the original name has a ".\d+" added to the contig name to identify the contig fragment (first three lines). The last two contigs were not broken into fragments and do not have an extra ".\d+". so far so good. after processing with merge_cutup_clustering.py, each contig is assigned a single cluster. Here, the odd part is when the name has a period followed by three or more 0s. In this case, the name is clipped up to the period, when the others remain as they should. Here are the results from my example: NODE_1_length_2595161_cov_8.709327,104 NODE_114750_length_1831_cov_0.514671,38 NODE_231037_length_1147_cov_1,144 (and not NODE_231037_length_1147_cov_1.000980,144) This is messing up downstream analysis because the contig names in the .fasta file are not matching my cluster assignments. Any idea why this might be happening? I have attached the merge_cutup_clustering.py code that I have installed below. Thanks!! ``` #!/data/ccallen/miniconda/envs/metawrap-env/bin/python """ With contigs cutup with cut_up_fasta.py as input, sees to that the consequtive parts of the original contigs are merged. prints result to stdout. @author: alneberg """ from __future__ import print_function import sys import os import argparse from collections import defaultdict, Counter def original_contig_name_special(s): n = s.split(".")[-1] try: int(n) except: return s, 0 # Only small integers are likely to be # indicating a cutup part. if int(n) < 1000: return ".".join(s.split(".")[:-1]), int(n) else: # A large n indicates that the integer # was part of the original contig return s, 0 def main(args): all_seqs = {} all_originals = defaultdict(dict) first = True with open(args.cutup_clustering_result, 'r') as ifh: for line in ifh: if first: first=False continue line = line.strip() contig_id, cluster_id = line.split(',') original_contig_name, part_id = original_contig_name_special(contig_id) all_originals[original_contig_name][part_id] = cluster_id merged_contigs_stack = [] sys.stdout.write("contig_id,cluster_id\n") for original_contig_id, part_ids_d in all_originals.items(): if len(part_ids_d) > 1: c = Counter(part_ids_d.values()) cluster_id = c.most_common(1)[0][0] c_string = [(a,b) for a, b in c.items()] if len(c.values()) > 1: sys.stderr.write("{}\t{}, chosen: {}\n".format(original_contig_id, c_string, cluster_id)) else: sys.stderr.write("{}\t{}\n".format(original_contig_id, c_string)) else: cluster_id = list(part_ids_d.values())[0] sys.stdout.write("{},{}\n".format(original_contig_id, cluster_id)) if __name__ == "__main__": parser = argparse.ArgumentParser(description=__doc__) parser.add_argument("cutup_clustering_result", help=("Input cutup clustering result.")) args = parser.parse_args() main(args) ```

INFINITY1993 commented 2 years ago

It may give you the clue https://github.com/BinPro/CONCOCT/issues/247

ccgallen commented 2 years ago

Thank you @INFINITY1993 for the tip!

BinPro / CONCOCT

merge_cutup_clustering.py truncates contig name when three zeros follow period #311