bhattlab / MGEfinder

A toolbox for identifying mobile genetic element (MGE) insertions from short-read sequencing data of bacterial isolates.
MIT License
105 stars 16 forks source link

Confusion in Interpreting the Results #49

Closed miniluphy closed 8 months ago

miniluphy commented 8 months ago

Greetings! Firstly, I would like to express my sincere gratitude for your dedicated efforts in developing the MGEfinder software. Your work has greatly facilitated our research endeavors.

I am currently engaged in the analysis of short-read sequencing data related to the evolutionary drug resistance in Klebsiella pneumoniae. The goal is to uncover whether there have been alterations in MGEs within the strains during their evolutionary process. Presently, I have completed the analysis for 18 strains.

However, I have encountered some queries while interpreting the analysis results. In the "04.makefasta.ref.all_seqs.fna" file generated by MGEfinder, I observed only 4 sequences, whereas in the "01.clusterseq.ref.tsv" file, I noticed the presence of 455 sequences labeled as "inferred_seq." After annotating these sequences using ISfinder, all were identified as Insertion Sequence (IS) elements.

I would like to take this opportunity to seek your guidance on whether the interpretation and handling of these results are correct. Particularly, given the occurrence of only 4 sequences in the "04.makefasta.ref.all_seqs.fna" file, is there a possibility of oversight or misunderstanding in the analysis process? Your professional guidance will play a crucial role in resolving this matter, and I sincerely appreciate your assistance once again.

Best regards

Kindly be informed that, for the ease of data upload, the file type has been modified to .txt. 04.makefasta.ref.all_seqs.txt 01.clusterseq.ref.txt

durrantmm commented 8 months ago

I'm glad that you have found our tool useful! Our wiki may be useful, here are some excerpts:

The file 01.clusterseq..tsv a complete list of all inferred sequences for all ten files. This file includes the sample, pair_id (refers to results of the pair output files), method (sequence inference method, such as inferred_assembly_with_full_context), loc (the location of where the sequence was found, either in the reference genome, assembly, or dynamically-constructed database), the inferred_seq_length, the seqid (an identifier for each unique sequence), cluster (an identifier for the sequence cluster), group (an identifier for the cluster group), and the inferred_seq (nucleotide sequence of the element).

and

The files 04.makefasta..all_seqs.fna and 04.makefasta..repr_seqs.fna include FASTA files of all identified elements found in the genotype and summarize files. The 04.makefasta..all_seqs.fna includes all unique sequences identified (even if they differ by a single base pair), and 04.makefasta..repr_seqs.fna includes the representative sequence for each element cluster.

So the first file lists all the sequences along with their insertion sites across all samples. Many of the sequences are identical. 04.makefasta.ref.all_seqs.txt lists the unique sequences found. So it looks like there are 4 different elements that are hopping around your Klebsiella genomes! If there is something in 01.clusterseq.ref.txt that is definitely not found in 04.makefasta.ref.all_seqs.txt, let me know and I'll take a closer look.

miniluphy commented 8 months ago

Thank you very much for your timely response and patient guidance; your assistance has been invaluable to me.

I have some additional questions that I hope you could shed some light on. Based on the concatenation of the “cluster”, “group”, and “seqid” from the 01.clusterseq.tsv file, I used it as the “inferred_seq” identifier. After removing duplicate and identical sequences, the original 455 sequences have been streamlined to 12 unique ones, numbered as c0g1s0, c1g2s1, c1g2s3, c1g2s4, c13g3s13, c13g3s17, c14g3s14, c50g4s50, c50g4s53, c248g6s248, c289g5s103, c289g5s289. Through mutual blastn comparisons, I confirmed that these 12 sequences exhibit at least one base pair difference. However, in the 04.makefasta.C8092.repr_seqs.fna file, I observed only four sequences (c0_g1_s0, c1_g2_s1, c1_g2_s3, c1_g2_s4), while the remaining eight were not included in the count. Could you please provide some insights into whether this is a normal outcome?

Concerning my input data, obtained through de novo assembly of the contig.fasta file from second-generation sequencing, the reference genome includes a total of five sequences, numbered 1, 2, 3, 4, and 5, including the plasmid. In the 'loc' (location) column, I encountered labels like NODE_103_xxx, representing the MGE's location in the contig.fasta. Regarding the format such as 1:xxx, 2:xxx, 3:xxx, etc., I understand it to signify the MGE's location in the reference genome. However, for expressions like "13:0-1196" and "17:0-1196," could I interpret these as the MGE being located in a dynamically constructed database?

Furthermore, I am puzzled about the meaning of “pair_id”. It represents the results of paired output files, but why do some samples have multiple pair_ids? For example, all MGEs' pair_ids for sample 1 belong to “1”, while sample 2's pair_ids are distributed across 1,2,3, and 4. Are there any direct impacts or differences in actual analysis based on this?

Lastly, and most crucially, does each record in the 01.clusterseq.tsv file represent an independent MGE? I plan to use this as the foundation for counting MGEs and would like to understand if there's a possibility of duplicate calculations in this process.

Once again, thank you for your patient assistance, and I eagerly await your response. 01.clusterseq.ref.xlsx

durrantmm commented 8 months ago

Apologies, but this is a lot for me to get through. It would be most helpful if you shared all the generated files with me in a single ZIP. No need to share the assemblies or the reads, just the output.

To respond to the final question, I suppose it depends on what you mean by "independent" MGE. I would settle on your own definition and then apply it to your data. Something like a blast all v. all search or a clustering sequences with mmseqs may make sense. A % identity cutoff of 90-95% is pretty standard for grouping MGEs into taxonomic groups.

miniluphy commented 8 months ago

Thank you very much for your prompt response. I have attached the analysis results for your reference.

I apologize for any previous lack of clarity in explaining the concept of "independent MGE," and I appreciate your understanding. In reality, my goal is to utilize the MGEfinder analysis results to count the number of mobile genetic elements (MGEs) in different bacterial strains. Based on my understanding, each record in the 01.clusterseq.ref.tsv file corresponds to one MGE. Therefore, I have used a straightforward summation approach to calculate the total count. Could you kindly confirm if this method is deemed appropriate?

Thank you again~ 03.results.zip

durrantmm commented 8 months ago

Thanks for the files. Could you share the command you used to run the pipeline?

durrantmm commented 8 months ago

I think I know the issue. The default pipelines assume many input samples and assemblies. How many samples and assemblies do you have?

MGEfinder infers the identity of sequences using several methods. By default, if it can only determine the sequence of an insertion by inferring its identity from the reference genome (and it's never found in an assembly), then it will remove insertions when it performs the genotype step.

I think you should run a custom pipeline to increase the sensitivity, especially if you trust what you see in the 01.clusterseq* files.

First, download the snakemake files from here: https://github.com/bhattlab/MGEfinder/tree/master/mgefinder/workflow

You'll want to download the files denovo.original.Snakefile and denovo.original.config.yml, specifically.

Then edit the file denovo.original.config.yml so that it looks like this:

genotype:
  filter_clusters: False

Then you can rerun your workflow pointing to the new custom snakemake files.

mgefinder workflow custom --cores <number of thredas> --snakefile  denovo.original.Snakefile --configfile denovo.original.config.yml myWorkdir

This should solve your problem, as long as you are confident that those are real mobile elements.

miniluphy commented 8 months ago

Thank you for your response. I have 18 samples and assemblies. The command I used was "mgefinder workflow denovo --cores 4 workdir/". With your guidance and the use of a custom pipeline, I have successfully generated 12 unique clusters. Thank you.

durrantmm commented 8 months ago

Great! Happy to help.