bhattlab / MGEfinder

A toolbox for identifying mobile genetic element (MGE) insertions from short-read sequencing data of bacterial isolates.
MIT License
105 stars 16 forks source link

the sum of unique clusters in "01.clusterseq" file does not match the number of unique clusters in "03.sum_cluseter" file #30

Closed xuanji2017 closed 2 years ago

xuanji2017 commented 2 years ago

Hi, Thank you to make this great tool. I finally get the 03. results folder. But when I check the number of unique clusters in "01.clusterseq.GCA_000210735.tsv", I found the number is not the same as the number of clusters in 03.summarize.GCA_000210735.clusters.tsv. For example, 1331 vs 1234. The number of groups is also the same case. Besides, the number of unique inferred_seq in "01.clusterseq.GCA_000210735.tsv" is also not the same as the number of contigs in "04.makefasta.GCA_000210735.all_seqs.fna". Do you have any explanation for this? Thanks a lot!

durrantmm commented 2 years ago

Great question! So the genotyping step applies a filter to clusters. It's called "--filter-clusters-inferred-assembly". This removes clusters that were never identified from an assembly, meaning they were only found in the reference. You can remove this filter if you make your own custom snakemake pipeline.