Closed miniluphy closed 8 months ago
I'm glad that you have found our tool useful! Our wiki may be useful, here are some excerpts:
The file 01.clusterseq.
.tsv a complete list of all inferred sequences for all ten files. This file includes the sample, pair_id (refers to results of the pair output files), method (sequence inference method, such as inferred_assembly_with_full_context), loc (the location of where the sequence was found, either in the reference genome, assembly, or dynamically-constructed database), the inferred_seq_length, the seqid (an identifier for each unique sequence), cluster (an identifier for the sequence cluster), group (an identifier for the cluster group), and the inferred_seq (nucleotide sequence of the element).
and
The files 04.makefasta.
.all_seqs.fna and 04.makefasta. .repr_seqs.fna include FASTA files of all identified elements found in the genotype and summarize files. The 04.makefasta. .all_seqs.fna includes all unique sequences identified (even if they differ by a single base pair), and 04.makefasta. .repr_seqs.fna includes the representative sequence for each element cluster.
So the first file lists all the sequences along with their insertion sites across all samples. Many of the sequences are identical. 04.makefasta.ref.all_seqs.txt lists the unique sequences found. So it looks like there are 4 different elements that are hopping around your Klebsiella genomes! If there is something in 01.clusterseq.ref.txt that is definitely not found in 04.makefasta.ref.all_seqs.txt, let me know and I'll take a closer look.
Thank you very much for your timely response and patient guidance; your assistance has been invaluable to me.
I have some additional questions that I hope you could shed some light on. Based on the concatenation of the “cluster”, “group”, and “seqid” from the 01.clusterseq.tsv file, I used it as the “inferred_seq” identifier. After removing duplicate and identical sequences, the original 455 sequences have been streamlined to 12 unique ones, numbered as c0g1s0, c1g2s1, c1g2s3, c1g2s4, c13g3s13, c13g3s17, c14g3s14, c50g4s50, c50g4s53, c248g6s248, c289g5s103, c289g5s289. Through mutual blastn comparisons, I confirmed that these 12 sequences exhibit at least one base pair difference. However, in the 04.makefasta.C8092.repr_seqs.fna file, I observed only four sequences (c0_g1_s0, c1_g2_s1, c1_g2_s3, c1_g2_s4), while the remaining eight were not included in the count. Could you please provide some insights into whether this is a normal outcome?
Concerning my input data, obtained through de novo assembly of the contig.fasta file from second-generation sequencing, the reference genome includes a total of five sequences, numbered 1, 2, 3, 4, and 5, including the plasmid. In the 'loc' (location) column, I encountered labels like NODE_103_xxx, representing the MGE's location in the contig.fasta. Regarding the format such as 1:xxx, 2:xxx, 3:xxx, etc., I understand it to signify the MGE's location in the reference genome. However, for expressions like "13:0-1196" and "17:0-1196," could I interpret these as the MGE being located in a dynamically constructed database?
Furthermore, I am puzzled about the meaning of “pair_id”. It represents the results of paired output files, but why do some samples have multiple pair_ids? For example, all MGEs' pair_ids for sample 1 belong to “1”, while sample 2's pair_ids are distributed across 1,2,3, and 4. Are there any direct impacts or differences in actual analysis based on this?
Lastly, and most crucially, does each record in the 01.clusterseq.tsv file represent an independent MGE? I plan to use this as the foundation for counting MGEs and would like to understand if there's a possibility of duplicate calculations in this process.
Once again, thank you for your patient assistance, and I eagerly await your response. 01.clusterseq.ref.xlsx
Apologies, but this is a lot for me to get through. It would be most helpful if you shared all the generated files with me in a single ZIP. No need to share the assemblies or the reads, just the output.
To respond to the final question, I suppose it depends on what you mean by "independent" MGE. I would settle on your own definition and then apply it to your data. Something like a blast all v. all search or a clustering sequences with mmseqs may make sense. A % identity cutoff of 90-95% is pretty standard for grouping MGEs into taxonomic groups.
Thank you very much for your prompt response. I have attached the analysis results for your reference.
I apologize for any previous lack of clarity in explaining the concept of "independent MGE," and I appreciate your understanding. In reality, my goal is to utilize the MGEfinder analysis results to count the number of mobile genetic elements (MGEs) in different bacterial strains. Based on my understanding, each record in the 01.clusterseq.ref.tsv file corresponds to one MGE. Therefore, I have used a straightforward summation approach to calculate the total count. Could you kindly confirm if this method is deemed appropriate?
Thank you again~ 03.results.zip
Thanks for the files. Could you share the command you used to run the pipeline?
I think I know the issue. The default pipelines assume many input samples and assemblies. How many samples and assemblies do you have?
MGEfinder infers the identity of sequences using several methods. By default, if it can only determine the sequence of an insertion by inferring its identity from the reference genome (and it's never found in an assembly), then it will remove insertions when it performs the genotype
step.
I think you should run a custom pipeline to increase the sensitivity, especially if you trust what you see in the 01.clusterseq* files.
First, download the snakemake files from here: https://github.com/bhattlab/MGEfinder/tree/master/mgefinder/workflow
You'll want to download the files denovo.original.Snakefile
and denovo.original.config.yml
, specifically.
Then edit the file denovo.original.config.yml
so that it looks like this:
genotype:
filter_clusters: False
Then you can rerun your workflow pointing to the new custom snakemake files.
mgefinder workflow custom --cores <number of thredas> --snakefile denovo.original.Snakefile --configfile denovo.original.config.yml myWorkdir
This should solve your problem, as long as you are confident that those are real mobile elements.
Thank you for your response. I have 18 samples and assemblies. The command I used was "mgefinder workflow denovo --cores 4 workdir/". With your guidance and the use of a custom pipeline, I have successfully generated 12 unique clusters. Thank you.
Great! Happy to help.
Greetings! Firstly, I would like to express my sincere gratitude for your dedicated efforts in developing the MGEfinder software. Your work has greatly facilitated our research endeavors.
I am currently engaged in the analysis of short-read sequencing data related to the evolutionary drug resistance in Klebsiella pneumoniae. The goal is to uncover whether there have been alterations in MGEs within the strains during their evolutionary process. Presently, I have completed the analysis for 18 strains.
However, I have encountered some queries while interpreting the analysis results. In the "04.makefasta.ref.all_seqs.fna" file generated by MGEfinder, I observed only 4 sequences, whereas in the "01.clusterseq.ref.tsv" file, I noticed the presence of 455 sequences labeled as "inferred_seq." After annotating these sequences using ISfinder, all were identified as Insertion Sequence (IS) elements.
I would like to take this opportunity to seek your guidance on whether the interpretation and handling of these results are correct. Particularly, given the occurrence of only 4 sequences in the "04.makefasta.ref.all_seqs.fna" file, is there a possibility of oversight or misunderstanding in the analysis process? Your professional guidance will play a crucial role in resolving this matter, and I sincerely appreciate your assistance once again.
Best regards
Kindly be informed that, for the ease of data upload, the file type has been modified to .txt. 04.makefasta.ref.all_seqs.txt 01.clusterseq.ref.txt